Loading Data into Pandas: 5 Tips and Tricks You May or May Not Know
If you’ve ever worked with Pandas in Python before, you’ll know that there are a bunch of different ways in which you can import tabulated data.
Pandas is perhaps one of the most widely used Python packages for loading, manipulating and exporting data. We’ve even explored this in a separate blog post.
While you may be familiar with the many different ways to read and write data using pandas, you may not be aware of a few shortcuts/hacks which may not be as obvious as you may think. This can make importing data much easier and faster.
This blog post explores just five of these techniques
1. Import CSV from URL
You’re probably very familiar with the pd.read_csv()
method for importing Comma Separated Value files. But you may not know that it can be used to read data from a URL over HTTP.
For example, reading in a CSV file from a URL acts the same as a file stored locally.
>>> url = 'https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/sklearn/datasets/data/boston_house_prices.csv'
>>> df = pd.read_csv(url)
Rather than storing all your files locally, you can access CSV files from a remote location. This saves you the hassle of having to download things manually.
2. Import HTML Table from a Website
If you’ve ever read the pandas documentation, you probably would’ve come across the pd.read_html()
method. Much like, pd.read_csv()
this too also has the additional feature of exporting data from a URL. What makes this even cooler is that you can provide a URL to a website with many tables and it will scrape them all.
For example, consider the Wikipedia article “Economy of the United Kingdom”. This article can be scraped as follows.
>>> url = 'https://en.wikipedia.org/wiki/Economy_of_the_United_Kingdom'
>>> df_tables = pd.read_html(url)
This article has a total of 33 tables.
>>> len(df_tables)
33
3. JSON Normalised
Sometimes when you are working with JSON data, you’ll find that it’s quite common for a JSON object to feature many nested objects. For a JSON object to be converted into tabular data, they need to be flatted (has a nested depth of one).
Conveniently, pandas provide a way to do this using the pd.json_normalize()
method.
For example, consider the following JSON object (as used as an example in the pandas documentation)
>>> data = [
{
"id": 1,
"name": "Cole Volk",
"fitness": {"height": 130, "weight": 60},
},
{"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
{
"id": 2,
"name": "Faye Raker",
"fitness": {"height": 130, "weight": 60},
},
]
It features attributes which contain nested objects (“fitness” in this case). Using the pd.json_normalize()
, this can be collapsed into table. Note: an optional parameter of max_level
can be added to specify the maximum number of nested levels to collapse. By default, it will normalise all levels.
>>> df = pd.json_normalize(data)
>>> df
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mark Reg 130 60
2 2.0 Faye Raker 130 60
4. Import from Clipboard
This is a rather simple one but let’s suppose you are copying and pasting data from a website into a Word document. The pd.read_clipboard()
method can be used to capture whatever data is stored on your clipboard.
By default, this method accepts a regular expression of \\s+
as a delimiter for separating values. Alternatively, you can use your own regular expression.
>>> df = pd.read_clipboard(sep='\\s+')
This is especially handy as it saves you from writing the data by hand meaning that you’re less prone to making mistakes if it was done manually.
5. Import from Excel
Let’s be honest, everyone has used Microsoft Excel (other spreadsheet software is available) at some point in their lives. If I’m honest, I still use it a lot and it remains one of the most popular ways of sending tabulated data to others.
Let’s suppose that someone has sent some data to you that is stored in an Excel spreadsheet, the pd.read_excel()
method can be used to read the data as if it were a simple CSV file.
One thing to bear in mind is that spreadsheet software allows users to segment multiple spreadsheets using sheets within a single file. Accessing data from a specific sheet can be achieved using the sheet_name
parameter.
>>> df = pd.read_excel('my_spreadsheet.xlsx', sheet_name='Sheet 1')
Conclusions
In summary, this blog post features some of the most useful panders shortcuts I use daily. Let me know if I’m missing any obvious ones as I’m sure there are many more than the ones I’ve listed here. Overall, panders is an incredibly useful Python package and it has so many different features for manipulating, reading and writing data.