Różnica w zużyciu pamięci między gbm i blackboost

Pracuję nad bazą danych z około 250000 obserwacji i 50 predyktorów (niektóre są czynniki tak na koniec około 100 funkcji) i mam problem z wykorzystaniem funkcji blackboost() (od mboost pakiet), które dają mi błąd przydziału pamięci.Różnica w zużyciu pamięci między gbm i blackboost

W tym samym czasie gbm() nie ma problemu z ilością danych. Zgodnie z dokumentacją algorytm używany przez blackboost jest taki sam jak gbm. ("http://cran.r-project.org/web/packages/mboost/mboost.pdf").

Nie jest jasne, dlaczego jedna funkcja jest zdolny do zarządzania bazą danych, a nie drugi, moje domysły:

GBM posiada strategię subsampling (ustawioną przez „bag.fraction” argumentu), który nie robi” Wydaje się, że są implementowane w trybie blackboost i wpływają na wykorzystanie pamięci.
GBM korzystać z funkcji koszyka do budowania drzew i blackboost użytkowania Ctree która wydaje się mieć ogromne zużycie pamięci (How to remove training data from party:::ctree models?)

chcę użyć funkcji straty AUC() dostępny w mboost ale nie w GBM , więc byłbym zainteresowany wszelkimi sugestiami, aby przezwyciężyć ograniczenia wykorzystania pamięci blackboost.

Inną dodatkową pytanie, kiedy starają się zmniejszyć liczbę zmiennych w moim modelu, mam ten nowy błąd z blackboost:

Error in matrix(f[ind1], nrow = n0, ncol = n1, byrow = TRUE) : the length of the data [107324] is not a multiple of the number of lines [152107]

Wydaje się pochodzić z funkcji gradientu AUC.

Dziękuję za pomoc.

Źródło

2014-04-18 Alex

Masz rację, że ctree jest jedną z przyczyn. Poniżej przedstawiam poniższy skrypt ilustrujący tę kwestię. Możesz nieco zmniejszyć wymagania dotyczące pamięci poprzez ustawienia control = party::ctree_control(..., remove_weights = TRUE), tak jak pokazuję. Jednak nie można uniknąć dodatkowych przechowywanych data.frame i innych przyczyn użycia pamięci, o ile wiem.

Oto przykład:

# Load data and set options 
options(digits = 4) 
data("BostonHousing", package = "mlbench") 

# Size of the training size 
object.size(BostonHousing)/10^6 # in MB 
#> 0.1 bytes 

# blackboost and mboost stores a ctree like structure not on the object itself 
# but in an environment in the background. These can be big! 
# First, we use some of the default settings 
ctrl_lrg_mem <- party::ctree_control(
    teststat = "max", 
    testtype = "Teststatistic", 
    mincriterion = 0, 
    maxdepth = 3, 
    stump = FALSE, 
    minbucket = 20, 
    savesplitstats = FALSE, # Default w/ mboost 
    remove_weights = FALSE) # Default w/ mboost 

gc() # shows memory usage before 
#>   used (Mb) gc trigger (Mb) max used (Mb) 
#> Ncells 2467924 131.9 3886542 207.6 3886542 207.6 
#> Vcells 4553719 34.8 14341338 109.5 22408297 171.0 
fit1 <- mboost::blackboost(
    medv ~ ., data = BostonHousing, 
    tree_controls = ctrl_lrg_mem, 
    control = mboost::boost_control(
    mstop = 100)) 
gc() # shows memory usage after 
#>   used (Mb) gc trigger (Mb) max used (Mb) 
#> Ncells 2494735 133.3 3886542 207.6 3886542 207.6 
#> Vcells 5608368 42.8 14341338 109.5 22408297 171.0 

# It is not the object it self that requires a lot of memory 
object.size(fit1)/10^6 
#> 1.3 bytes 

# It is the objects stored in the environments in the back 
tmp_env <- environment(fit1$predict) 
length(tmp_env$ens) # The boosted trees 
#> [1] 100 
sum(unlist(lapply(tmp_env$ens, object.size)))/10^6 
#> [1] 7.312 

# Moreover, there is also a model frame for the data stored in the baselearner 
# function's environment which takes some space 
env <- environment(fit1$basemodel[[1]]$fit) 
str(env$df) # data frame of initial data 
#> 'data.frame': 506 obs. of 14 variables: 
#> $ crim      : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... 
#> $ zn      : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... 
#> $ indus     : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... 
#> $ chas      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... 
#> $ nox      : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... 
#> $ rm      : num 6.58 6.42 7.18 7 7.15 ... 
#> $ age      : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... 
#> $ dis      : num 4.09 4.97 4.97 6.06 6.06 ... 
#> $ rad      : num 1 2 2 3 3 3 5 5 5 5 ... 
#> $ tax      : num 296 242 242 222 222 222 311 311 311 311 ... 
#> $ ptratio     : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... 
#> $ b      : num 397 397 393 395 397 ... 
#> $ lstat     : num 4.98 9.14 4.03 2.94 5.33 ... 
#> $ WLKJDJDQYBTDQCZDNHZMPZNCS: num 0 0 0 0 0 0 0 0 0 0 ... 
object.size(env$df)/10^6 
#> 0.1 bytes 
# str(env$object) # output excluded for space reasons 
object.size(env$object)/10^6 
#> 0.8 bytes 

# The above implies that if you data is 1GB then the fit will require 1 GB as 
# well as far as I gather 

# We can though reduce the memory requirements 
ctrl_sml_mem <- party::ctree_control(
    teststat = "max", 
    testtype = "Teststatistic", 
    mincriterion = 0, 
    maxdepth = 3, 
    stump = FALSE, 
    minbucket = 20, 
    savesplitstats = FALSE, 
    remove_weights = TRUE) # Changed 

gc() 
#>   used (Mb) gc trigger (Mb) max used (Mb) 
#> Ncells 2494810 133.3 3886542 207.6 3886542 207.6 
#> Vcells 5608406 42.8 14341338 109.5 22408297 171.0 
fit2 <- mboost::blackboost(
    medv ~ ., data = BostonHousing, 
    tree_controls = ctrl_sml_mem, 
    control = mboost::boost_control(
    mstop = 100)) 
gc() 
#>   used (Mb) gc trigger (Mb) max used (Mb) 
#> Ncells 2520425 134.7 3886542 207.6 3886542 207.6 
#> Vcells 6081411 46.4 14341338 109.5 22408297 171.0 

# Reduces the size of the objects in the back 
tmp_env <- environment(fit2$predict) 
length(tmp_env$ens) # The boosted trees 
#> [1] 100 
sum(unlist(lapply(tmp_env$ens, object.size)))/10^6 
#> [1] 2.611 

##### 
# The version I run 
sessionInfo(package = c("party", "mboost")) 
#> R version 3.4.0 (2017-04-21) 
#> Platform: x86_64-w64-mingw32/x64 (64-bit) 
#> Running under: Windows >= 8 x64 (build 9200) 
#> 
#> Matrix products: default 
#> 
#> locale: 
#> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 
#> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C       
#> [5] LC_TIME=English_United Kingdom.1252  
#> 
#> attached base packages: 
#> character(0) 
#> 
#> other attached packages: 
#> [1] party_1.2-3 mboost_2.8-0 
#> 
#> loaded via a namespace (and not attached): 
#> [1] Rcpp_0.12.11  compiler_3.4.0  formatR_1.4   git2r_0.18.0  R.methodsS3_1.7.1 
#> [6] methods_3.4.0  R.utils_2.5.0  utils_3.4.0   tools_3.4.0   grDevices_3.4.0  
#> [11] boot_1.3-19   digest_0.6.12  jsonlite_1.4  memoise_1.1.0  R.cache_0.12.0  
#> [16] lattice_0.20-35  Matrix_1.2-9  shiny_1.0.2   parallel_3.4.0  curl_2.5   
#> [21] mvtnorm_1.0-6  speedglm_0.3-2  coin_1.1-3   R.rsp_0.41.0  withr_1.0.2   
#> [26] httr_1.2.1   stringr_1.2.0  knitr_1.15.1  stabs_0.6-2   graphics_3.4.0  
#> [31] datasets_3.4.0  stats_3.4.0   devtools_1.12.0  stats4_3.4.0  dynamichazard_0.3.0 
#> [36] grid_3.4.0   base_3.4.0   data.table_1.10.4 R6_2.2.0   survival_2.41-2  
#> [41] multcomp_1.4-6  TH.data_1.0-8  magrittr_1.5  nnls_1.4   codetools_0.2-15 
#> [46] modeltools_0.2-21 htmltools_0.3.6  splines_3.4.0  MASS_7.3-47   rsconnect_0.7  
#> [51] strucchange_1.5-1 mime_0.5   xtable_1.8-2  httpuv_1.3.3  quadprog_1.5-5  
#> [56] sandwich_2.3-4  stringi_1.1.5  zoo_1.8-0   R.oo_1.21.0

Źródło

2017-05-31 08:17:09

Różnica w zużyciu pamięci między gbm i blackboost

Odpowiedz

Powiązane problemy