After MapReduce, MongoDB & NoSQL got all the attention in 2013, I feel it’s time to take a look at some more basic but often forgotten command line tools which help you process BigData in a powerful way.
Here our list (which we collectively call #HypeReduce):
#1) logrotate (8) – this nifty Unix command is usually called from a cron job. Logrotate, rotates, compresses, and mails system logs and therefore ensures BigData doesn’t become a problem in the first place. Anytime one of your team members feels like it’s a good time to blabber about BigData-benefits, you should be able to silence them quickly by mentioning this command.
2) find (1) – is a powerful command helping you get back data which you lost track of or never knew existed. Especially in combination with xargs (1) it let’s you clean up for good. Don’t be intimidated by it’s many options. Zen-like knowledge of this tool doesn’t come over night and is often passed down by older generations of sysadmins and BigData gurus.
3) du (1) – estimate file space usage, this little command doesn’t look like much, at least at first sight … But don’t be fooled, because this gem helps you determine if files become too big long before anyone can shout MapReduce!
4) lsof (8) – list open files. This is a more exotic tool which some of you may wonder what it has to do with BigData? It’s power stems from allowing you to keep track of any process writing excessively to files. If some command is too chatty for it’s own good you can use kill (1) to make it STFU.
5) rm (1) – remove files or directories. This command has almost been entirely forgotten and is deprecated on most modern data-processing systems. It is therefore rarely ever mentioned in BigData discussions. If still installed on your OS, we recommend you to get familiar with it. If unavailable on your machine you can probably also write your own implementation by using your favourite programming language which may still support the unlink(2) system call.
#6) xz, bzip2, gzip (1) – compress files. Compression allows you to keep anyone in check who claims his data is bigger than yours. Although not as powerful as rm (1), it should buy you plenty of time in order to come up with a more cunning strategy. Data is like sex: size doesn’t matter but quality does! So don’t listen to anyone with an agenda trying to make you insecure about the size of your Data and turn BigData into SmallData using compression!
#7) /dev/null – this is not a command per se, but it’s existence on modern Unix variants prevents useless data to spread like cancer and immediately sends it on to where it belongs: the bitbucket!