Brief introduction to SparkUI

Aug 31, 2020

Recently, I've been working a lot with PySpark in AWS EMR. I have a huge data dump (~300 million users) that I needed to process and transform it into the right format for further processing. The data is in a S3 bucket in AWS and I use AWS EMR ...

Diving into H2O with R

Mar 28, 2017

Do you understand the pain when you have to train advanced machine learning algorithms like Random Forest on huge datasets? When there is a factor column that has way too many number of levels? When the time taken to train the model is so huge that you went to your ...

Hadoop Streaming

Aug 29, 2016

A few days ago, I had written a post on The Big Data Problem which attempted to understand why we need big data and what the fuss is all about. You may want to read it here.

Having understood why we need big data, let’s understand how we can ...

The Big Data Problem

Jun 29, 2016

Big data has become a sensation these days. Anyone and everyone wants to use this in their discussions. When I was still in my college and preparing for campus placements, I had attended almost all the pre-placement talks that companies gave to its prospective candidates.

American Express was one such ...

Understanding HIVE for data science people

Jun 22, 2016

I have been working as Statistical analyst for the last 1.5 years and fortunately I got to work on Hadoop on one of my initial projects. Hadoop sounds scary to a lot of people and I am no exception. In this post, I would make an attempt to explain ...

Manish Barnwal

...just another human

Brief introduction to SparkUI

Diving into H2O with R

Hadoop Streaming

The Big Data Problem

Understanding HIVE for data science people