I am no expert of shell commands. I have been using them for quite some time and thought I give an attempt to list down the most common commands. I am writing these mostly from the perspective of a data-science guy. Let us get started.
I will use the file- ‘data.txt’ to illustrate these commands. ‘data.txt’ is a file having 200 rows and 8 columns. You can access the data here.
cat Throws the the contents of the entire file at your terminal
cat data.txt
We don’t want to bombard our terminal with the complete content of the file. Instead, if you want to have a complete look at the file, open the file in vim editor
vim Opens the file in the editor
vim data.txt
head Gives the top 10 rows of the text file at your terminal
head data.txt
tail Gives the bottom 10 rows of the text file at your terminal
tail -n 2 data.txt -- This will give you the bottom 2 rows of the file
The piping operator
cat data.txt | head
Notice the | operator. This is called pipe operator. Piping is a concept wherein you can perform a sequence of operations in a single command.
So what exactly is piping?
A pipe is a facility of the shell that makes it easy to chain together multiple commands. When used between two Unix commands, it means that output from the first command should become the input to the second command. Just read the | in the command as pass the data onto. More on this operator later in the post.
wc
wc is a fairly useful shell command that lets you count the number of lines(-l), words(-w) or characters(-c) in a given file
wc -l data.txt -- gives you the number of lines in the file
wc -w data.txt -- gives you the number of words in the file
wc -c data.txt -- gives you the number of characters in the file
head -n 1 data.txt| wc -w -- gives you the number of columns in the file
grep
Consider ‘grep’ as a command to filter on the results you get. You may want to print all the lines in your file which have a particular phrase. Say for example you want to see people who are ‘Very Happy’. You simply pass this to grep command.
grep 'Very Happy' data.txt | head
-- gives you the top 10 rows having 'Very Happy'
Let us say, we want to count the number of users who are ‘Not Happy’
grep 'Not Happy' data.txt | wc -l
-- gives you the top 10 rows having 'Very Happy'
sort
If you want to sort the data based on some column, say ‘Score’; ‘Score’ is the 3rd column in the file- data.txt
sort -t ',' -k 3 -n -r data.txt |head -5
-- gives you the top 10 rows having 'Very Happy'
Explanation: -t is used to specify the delimiter; ‘,’ in this case.
If the delimiter is ‘\t’, we don’t need to specify -t argument. Space is taken as delimiter by default.
- k is used to specify the column based on which you want to sort the data; 3 in this case
- n is to specify that sorting is to be done numerically
- r is to imply that the sorting is descending
cut
This command gives you only specific column. Say you want to see only the 4th column of the file.
cut data.txt -d ',' -f 4 |head
Explanation: ‘,’ is the delimiter. 4 is the column number that you want to see.
uniq
Do not confuse this command for ‘unique’. It is slightly different. This removes sequential duplicates. So if you want to get unique values from a column, you need to first sort the data and then use this uniq command in sequence.
To get the unique of a column, say the 2nd column
cut data.txt -d ',' -f 2 |sort|uniq
This command could be used with argument -c to count the occurrence of these distinct values. Something like to count distinct in SQL.
cut data.txt -d ',' -f 2 |sort|uniq-c
tr tr stands for translate
‘Find and Replace’ function that we have in excel. Yes we have that in UNIX as well. A typical use of this command that I use on regular basis is that I get the file from HIVE which are tab delimited. And say I want to convert it to ‘,’ delimited.
You may also want to replace certain characters in file with something else using the tr command.
cat data.txt | tr ',' '\t' -- Changed ',' delimited to '\t' delimited
Save to a new file or append to an existing file
> and >> operator
Say you want to save the output of operations to some file. You use ‘>’ or ‘>>’ depending on whether you want it to be a new file or you want to append it to an existing file.
I will update this list as and when I see a command deserving enough to be in a data scientist’s toolbox.
Did you find the article useful? If you did, share your thoughts on the topic in the comments.