When deploying R codes across the cluster, many a times the reason for the failure of the task is unavailability of a particular package across all nodes of the cluster. We wait for someone to get the package installed across all the nodes. This may take some days. Do we wait for them? Naah!
Presenting a temporary solution that one of my colleague came up with. I have used this technique and this works smoothly.
The following are the steps:
- Install the package you require on one of the edge nodes into a local directory
- Create a local directory. Let's say our directory name is rPackages
mkdir rPackages
- Install the required package, say 'randomForest' in the directory just created
install.packages('randomForest', repos=’repo_name', lib='rPackages/') Note that you need to choose the appropriate repo_name. The one that your company allows.
- Create a local directory. Let's say our directory name is rPackages
2. Check if you can load the package from this local directory
library(randomForest, lib.loc='rPackages/')
3. Create zip file of “dir_location” using command
zip -r rPackages.zip rPackages/
4. Add this zip file in your HIVE hql (or anything else)
add file rPackages.zip;
Don’t forget the semicolon
5. Unzip the file inside R script i.e. each reducer will have rPackages directory now
unzip('rPackages.zip', overwrite=TRUE)
6. Load the package now
library(randomForest, lib.loc='rPackages/')
And you’re done! Remember, you have to build the package where you want to use, because built packages are OS dependent.
Did you find the article useful? If you did, share your thoughts in the comments.