Manish Barnwal

...just another human

Brief introduction to SparkUI

Recently, I've been working a lot with PySpark in AWS EMR. I have a huge data dump (~300 million users) that I needed to process and transform it into the right format for further processing. The data is in a S3 bucket in AWS and I use AWS EMR ...

Diving into H2O with R

Do you understand the pain when you have to train advanced machine learning algorithms like Random Forest on huge datasets? When there is a factor column that has way too many number of levels? When the time taken to train the model is so huge that you went to your ...

Hadoop Streaming

A few days ago, I had written a post on The Big Data Problem which attempted to understand why we need big data and what the fuss is all about. You may want to read it here.

Having understood why we need big data, let’s understand how we can ...

The Big Data Problem

Big data has become a sensation these days. Anyone and everyone wants to use this in their discussions. When I was still in my college and preparing for campus placements, I had attended almost all the pre-placement talks that companies gave to its prospective candidates.

American Express was one such ...

Understanding HIVE for data science people

I have been working as Statistical analyst for the last 1.5 years and fortunately I got to work on Hadoop on one of my initial projects. Hadoop sounds scary to a lot of people and I am no exception. In this post, I would make an attempt to explain ...