Shell commands come in handy for a data scientist

I am no expert of shell commands. I have been using them for quite some time and thought I give an attempt to list down the most common commands. I am writing these mostly from the perspective of a data-science guy. Let us get started.

I will use the file- ‘data.txt’ to illustrate these commands. ‘data.txt’ is a file having 200 rows and 8 columns. You can access the data here.

cat Throws the the contents of the entire file at your terminal

    cat data.txt

We don’t want to bombard our terminal with the complete content of the file. Instead, if you want to have a complete look at the file, open the file in vim editor

vim Opens the file in the editor

    vim data.txt

head Gives the top 10 rows of the text file at your terminal

    head data.txt

tail Gives the bottom 10 rows of the text file at your terminal

tail -n 2 data.txt -- This will give you the bottom 2 rows of the file

The piping operator

cat data.txt | head

Notice the | operator. This is called pipe operator. Piping is a concept wherein you can perform a sequence of operations in a single command.

So what exactly is piping?

A pipe is a facility of the shell that makes it easy to chain together multiple commands. When used between two Unix commands, it means that output from the first command should become the input to the second command. Just read the | in the command as pass the data onto. More on this operator later in the post.

wc

wc is a fairly useful shell command that lets you count the number of lines(-l), words(-w) or characters(-c) in a given file

wc -l data.txt -- gives you the number of lines in the file

wc -w data.txt -- gives you the number of words in the file

wc -c data.txt -- gives you the number of characters in the file

head -n 1 data.txt| wc -w -- gives you the number of columns in the file

grep

Consider ‘grep’ as a command to filter on the results you get. You may want to print all the lines in your file which have a particular phrase. Say for example you want to see people who are ‘Very Happy’. You simply pass this to grep command.

grep 'Very Happy' data.txt | head
-- gives you the top 10 rows having 'Very Happy'

Let us say, we want to count the number of users who are ‘Not Happy’

grep 'Not Happy' data.txt | wc -l
-- gives you the top 10 rows having 'Very Happy'

sort

If you want to sort the data based on some column, say ‘Score’; ‘Score’ is the 3rd column in the file- data.txt

sort -t ',' -k 3 -n -r data.txt |head -5
-- gives you the top 10 rows having 'Very Happy'

Explanation: -t is used to specify the delimiter; ‘,’ in this case.

If the delimiter is ‘\t’, we don’t need to specify -t argument. Space is taken as delimiter by default.

k is used to specify the column based on which you want to sort the data; 3 in this case
n is to specify that sorting is to be done numerically
r is to imply that the sorting is descending

cut

This command gives you only specific column. Say you want to see only the 4th column of the file.

cut data.txt -d ',' -f 4 |head

Explanation: ‘,’ is the delimiter. 4 is the column number that you want to see.

uniq

Do not confuse this command for ‘unique’. It is slightly different. This removes sequential duplicates. So if you want to get unique values from a column, you need to first sort the data and then use this uniq command in sequence.

To get the unique of a column, say the 2nd column

cut data.txt -d ',' -f 2 |sort|uniq

This command could be used with argument -c to count the occurrence of these distinct values. Something like to count distinct in SQL.

cut data.txt -d ',' -f 2 |sort|uniq-c

tr tr stands for translate

‘Find and Replace’ function that we have in excel. Yes we have that in UNIX as well. A typical use of this command that I use on regular basis is that I get the file from HIVE which are tab delimited. And say I want to convert it to ‘,’ delimited.

You may also want to replace certain characters in file with something else using the tr command.

cat data.txt | tr ',' '\t'  -- Changed ',' delimited to '\t' delimited

Save to a new file or append to an existing file

> and >> operator Say you want to save the output of operations to some file. You use ‘>’ or ‘>>’ depending on whether you want it to be a new file or you want to append it to an existing file.

I will update this list as and when I see a command deserving enough to be in a data scientist’s toolbox.

Did you find the article useful? If you did, share your thoughts on the topic in the comments.

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Manish Barnwal

...just another human

Shell commands come in handy for a data scientist

The piping operator

So what exactly is piping?

Comments