Manish Barnwal

...just another human

When R package is not available across the cluster

When deploying R codes across the cluster, many a times the reason for the failure of the task is unavailability of a particular package across all nodes of the cluster. We wait for someone to get the package installed across all the nodes. This may take some days. Do we wait for them? Naah!

Presenting a temporary solution that one of my colleague came up with. I have used this technique and this works smoothly.

The following are the steps:

  1. Install the package you require on one of the edge nodes into a local directory
    • Create a local directory. Let's say our directory name is rPackages
          mkdir rPackages
          
    • Install the required package, say 'randomForest' in the directory just created
          install.packages('randomForest', repos=’repo_name', lib='rPackages/')
          Note that you need to choose the appropriate repo_name. The one that your company allows.
          


2. Check if you can load the package from this local directory

library(randomForest, lib.loc='rPackages/')


3. Create zip file of “dir_location” using command

zip -r rPackages.zip rPackages/


4. Add this zip file in your HIVE hql (or anything else)

add file rPackages.zip;
Don’t forget the semicolon


5. Unzip the file inside R script i.e. each reducer will have rPackages directory now

unzip('rPackages.zip', overwrite=TRUE)


6. Load the package now

library(randomForest, lib.loc='rPackages/')


And you’re done! Remember, you have to build the package where you want to use, because built packages are OS dependent.

Did you find the article useful? If you did, share your thoughts in the comments.

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Comments