Top 5 Software I Use For Data Science Projects Everyday

Like most data scientists and researchers, I spend most of my day in front of a computer. Most of the time, I’m either writing, reading or coding something exciting up.

Over the past couple of years as a researcher / academic / data scientist, I’ve come to use a lot of software. Some fancy, others not so. Some free, others expensive. Of all the software I use, pretty much all of them are free and open source feeling that they’re available for everyone.

In this blog post, I mention my top five pieces of software I couldn’t do without, and I have come to use on, almost, a daily basis. If you’re new to data science, helpfully this will be of help to you.

In no particular order, here are my top five most useful pieces of software I use for data science related projects.

VS Code

VS Code (known formally as Visual Studio Code) is a highly extendable text editor produced my Microsoft. You can do pretty much anything all the fancy overkill IDs can do all of them for application. There is a whole marketplace dedicated to third-party extensions, where people can contribute their own features making it easier for you to build your own custom development environment for very little overhead.

I personally mostly use it for building and testing Python scripts, but I have also used it for other projects in the past like writing applications in the Go programming language and for web development. One of my favourite features is the ability to access a server via SSH to remotely edit and execute code.

In my opinion, I would consider VS Code as essential software for anyone working with code.

Jupyter Notebook

Jupyter Notebook for building interactive Python code which runs within a web bowser (and also VS Code with the official extension). Jupyter Notebooks are built from “cells” which used to store blocks of Python code which can be executed and for storing blocks of Markdown code for annotating, documenting your analysis.

I would personally consider Jupiter Notebook as “must have software” for any data scientist as it’s designed to help you document your work, and present it to others in a meaningful way – much like a physical notebook. The use of the Markdown feature help build a dialogue to document your findings. This is particularly beneficial if you’re writing research papers or presenting your findings to others.

Anaconda

Anaconda is considered a distribution of Python which specialises for scientific computing. This includes tasks like data sites and machine learning. It also features a virtual environment manager for managing different types of projects. Anaconda is also packed with loads of Python packages to help you build the optimal environment for your type of analysis.

Although it’s not what you think of when it comes to the “conventional” software but, many people (including myself) use it every day without realising it as it just sits in the background and goes unnoticed. I use Anaconda for creating virtual environments to feature certain Python packages I need for performing my network analysis. In my case, this includes things like pandas, networkx, numpy and matplotlib.

Gephi

Gephi is an open source piece of software designed for network analysis and visualisations. It is well established among the academic research community for building some stunning network visualisations using large data sets. It is also fairly versatile and allows you to perform custom analysis.

If you busted this blog before, you’ll know by now that I work a lot with networks as most of my academic interest are social network analysis. Gephi is an amazing piece of software for building, analysing and visualising networks of all shapes and sizes. If you’re interested in social network analysis, Gephi is your best companion.

Any terminal

This one may not sound too exciting, but again, it’s another one of those things I use on a daily basis without really realising it. I mainly use the terminal to remotely access a Linux server by SSH so that I can execute commands for processing large data sets on a remote machine. This is handy if I want to let something run overnight without my laptop open all the time.

In my case, I use the built-in macOS terminal app, but there are many terminal-based applications out there. I believe windows have their own with bash included too! Alternatively, anything with SSH connectivity enabled will do.

Conclusions

In this blog post, discuss five pieces of software I use on a daily basis for analysing data. Bear in mind that I am mostly analysing network structures, so my requirements will differ from your average data scientist. This is by no means an exhaustive list, and there may be a few things that I have completely overlooked.

Think I’ve missed anything important? Let me know as I’m always up for discovering new software to use. What software do you use?