Linux Software I Use for Data Science Via the Terminal

In the previous post, I talked about some of the software I use every day for performing data science related tasks. In my case, the vast majority of the time I use the software to perform social network analysis (my area of expertise) as well other generic data science tasks (such as scraping and data manipulation).

Towards the end of the post, I mentioned how I use the built-in macOS terminal application to remotely access Linux servers via SSH to run commands for processing large datasets. It’s worth remembering that all of this is done without a desktop environment – using just the command prompt.

In this blog post, I cover some of the most useful and important Linux software I use regularly. These include tools I use for analysing data sets as well as other general, but useful, software.

tmux

tmux is a terminal multiplex which is a fancy name for running multiple terminal sessions at once. tmux allows you to switch between different sessions and keep them running in the background – even if you’re not logged into the server.

I often use tmux to open up a new session to run different commands (such as executing Python scripts or formatting large datasets). The best part of tmux is that I can leave the program running in the background whilst I’m disconnected from the server meaning that I can get on with other tasks. It also has a split screen feature meaning that I can have multiple terminals open in once session. Awesome!

IPython

A large part of my job as a data scientist involves putting together Python scripts. If I need to automate a task, or scrape some data, I would put together a quick Python script to help speed things up.

But what If, you only need to run just a few lines of code without the need to a full text editor? IPython (Interactive Python) allows you to execute Python code as commands within the terminal. This is really handy if I want to quickly examine or test without the need of putting together a Python script. I believe that some Linux distros already have ipython pre-installed.

ImageMagick

ImageMagick is a collection of software for performing many complex image manipulation and editing tasks on a wide range of image formats. It can be used for a variety of different applications including graphics design, digital art and data visualisations. In other words, it’s essentially a very basic version of Photoshop within the command line.

I don’t use ImageMagick an awful lot, but when I do, it’s an incredibly valuable tool for performing very basic tasks such as making a background transparent. I mainly use it to create animated data visualisations in the form of an animated GIF by stitching together a collection of frames exported from mattplotlib.

Graphviz

Graphviz is a graph visualisation tool for building network-based diagrams using the DOT language. It is widely used for networking, bioinformatics, software engineering and more. It also has a collection of neat layout engines. Much like ImageMagick, it can be treated as another data visualisation tool but for graphs and networks instead.

I find Graphviz useful for building / previewing social networks if I’m accessing a headless server. There are a few Python packages (networkx being one of them) which can access Graphviz from a script meaning that you can export networks (via Graphviz) programmatically. I can then view the results on my main desktop machine.

Other honourable mentions

Among the main software mentioned in this blog post, here are a few honourable mentions, which are also just as important they’re not strictly data science related.

  • Git: Useful for sharing stuff with others. I also use Git as a means to back my code up to a private repository.
  • SCP: Used to share files between machines. I use SCP to transfer data that I’ve collected from a server to my local machine, so it can be analysed.
  • nano / Vim: If I need to edit a file quickly without the use of a full GUI text editor (such as VS Code)

Conclusions

In this blog post, I outline some of the most important software I use regularly using nothing but the Linux terminal. It is by no means a comprehensive list as I’m sure there is more I use, which I’m not aware of. What terminal-based software do you use the most?