Manish Barnwal

...just another human

Brief introduction to SparkUI

Recently, I've been working a lot with PySpark in AWS EMR. I have a huge data dump (~300 million users) that I needed to process and transform it into the right format for further processing. The data is in a S3 bucket in AWS and I use AWS EMR ...

Why you should use logging instead of print statements?

It is a common but wrong practice to add print statements inside your code to convey message about the code/function on the standard output - at least that is what I used to do till last year. Last year, I got to know about logging and got to understand its ...

Personal finance 101

I am not going to talk about the power of compounding, why money is important, how can you make your money work for you, or any similar sounding cliched titles. If you are reading this blog, I assume you already understand that money is important and you would love to ...

Cluster NSE top 500 companies

Idea

Each company can be represented  by few metrics that would define the health of the company. Metrics like eps, revenue, market-cap, last 4 quarter earnings, RoE, PE ratio, etc. There could be other metrics or features as well. If we represent each company with these metrics and apply clustering ...

Types of data in recommender systems

There are two ways in which we can collect data for building recommender systems — explicit and implicit. In this post, we will talk about both types of data, their characteristics and the challenges with them.

Explicit feedback datasets

The dictionary meaning of explicit is to state clearly and in detail ...

Handling errors with try-catch in Python

In the previous post I discuss about how to convert a string to date format in Python. I was working on similar idea today. I had a column of object type which was string of dates. The column name is 'signed_up_at' and I wanted to convert it to date format ...

Working with dates in Python

I cringe every time I see a date type column in the data. And you may ask why so? Date columns need some methods applied to them

The reason is I don't normally see date columns in the data I work with so I don't remember the functions ...

git and github for data scientists

It has been close to a year since I shifted to a start-up which incidentally got acquired after a month of my joining. Before this I used to work at WalmartLabs where we always wanted to use a version control system like git but it never took off properly. Now ...

Creating a virtual environment in Python

I was trying to get a virtual environment set up on Python 3 using mkvirtualenv but somehow the virtual environment was getting created on Python 2.7 (my system python).

If you already know about virtual environments and why they are useful, you may skip the next two paragraphs. I ...