Finding the ideal dataset is not always an easy job. One option may include putting together your own dataset to suit the needs of others and yourself using techniques such as web and API scraping.
This might be necessary depending on your circumstances however, it does come at a cost. Collecting data of your own can be challenging as it grows over time and requires ongoing maintenance to preserve its quality.
Furthermore, there are other constraints which need to be taken into account such as hosting files (where and how to store data), ethical considerations (e.g. handling of personal information) and accessibility (sharing your data with others).
This may not always be the best or most practical option as collecting and building your own dataset can become quite time-consuming.
If this does not appeal to you, thankfully there is an alternative option. There are many platforms online for finding publicly available datasets maintained by others.
In this post, we focus on five of the most popular platforms for discovering and publishing datasets.
Whether you wish to put together a helpful data visualisation or to build a complex machine learning model, there is a strong possibility that the dataset you’re after has already been created and published by someone else.
Google is perhaps best well-known for its search engine for allowing you to search the entire web with a few keywords. What very few people do know is that Google has a service for searching specifically for datasets.
This is super handy if you don’t know exactly what you’re searching for. All you need to do is type a few keywords describing the dataset you’re interested in. For example, searching “fake news tweets” returns a total of 90 results!
This is particularly ideal for those from a research/academic background as you’ll be able to obtain the original research papers and journals associated with the data.
Kaggle is an online platform/community consisting of data scientists and researchers alike where users can publish and explore datasets with others. It is perhaps best well known for posting various competitions and events by allowing groups of individuals to enter competitions to win prizes for solving complex data science-related challenges.
All of the datasets used as part of these challenges can be found on the dataset section of the website. Feel free to enter one of these challenges if you’re interested in learning more about data science or you just wish to test your skills. There is also the possibility of winning a cash prize if you get good results!
If you thought that GitHub was just for hosting and collaborating on open-source software then you are wrong! The “Awesome Public Datasets” repository is a document (or “wiki”) where users can contribute links to public datasets.
While there is no convenient search feature for this option, this repo serves as a directory to point you in the right direction and to allow others to share their own datasets.
I personally use this repository for finding interesting datasets when I don’t know what I’m explicitly looking for.
You’ve probably heard of Wikipedia but you may not have heard of Wikidata. Wikidata is a free and open knowledge base for retrieving facts and figures about anything and everything. Wikidata is described as a knowledge graph – a data structure for representing linked data.
Data from Wikidata can be queried using their SPARQL query service for obtaining data. For example, the screenshot below shows how to find famous people born in New York.
You may not have thought of Reddit as a place to host datasets but the r/datasets subreddit has an active community of users sharing and requesting datasets. If you couldn’t find what you were after using the options above, this community could be the place for you. They may be able to point you in the right direction when all else fails.
To summarise, this post focuses on five of the most popular platforms for finding datasets as an alternative to collecting your own. I’m sure there are many other platforms which I have failed to mention but these are the ones I have used before.
Don’t get me wrong, while there is nothing wrong with collecting and building your own dataset, it’s always worth double-checking to see if the data you need has already been published by someone else. This can save you loads of time and energy as someone has done the hard work for you.