Loading Data into Pandas: 5 Tips and Tricks You May or May Not Know
If you’ve ever worked with Pandas in Python before, you’ll know that there are a bunch of different ways in which you can import tabulated data.
Pandas is perhaps one of the most widely used Python packages for loading, manipulating and exporting data. We’ve even explored this in a separate blog post.
While you may be familiar with the many different ways to read and write data using pandas, you may not be aware of a few shortcuts/hacks which may not be as obvious as you may think. This can make importing data much easier and faster.
This blog post explores just five of these techniques
1. Import CSV from URL
You’re probably very familiar with the pd.read_csv()
method for importing Comma Separated Value files. But you may not know that it can be used to read data from a URL over HTTP.
For example, reading in a CSV file from a URL acts the same as a file stored locally.
>>> url = 'https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/sklearn/datasets/data/boston_house_prices.csv'
>>> df = pd.read_csv(url)
>>> df
506 13 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
.. ... .. ... ... ... ... ... ... ... ... ... ... ... ...
502 0.06263 0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21 391.99 9.67 22.4
503 0.04527 0 11.93 0 0.573 6.12 76.7 2.2875 1 273 21 396.9 9.08 20.6
504 0.06076 0 11.93 0 0.573 6.976 91 2.1675 1 273 21 396.9 5.64 23.9
05 0.10959 0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21 393.45 6.48 22
506 0.04741 0 11.93 0 0.573 6.03 80.8 2.505 1 273 21 396.9 7.88 11.9
Rather than storing all your files locally, you can access CSV files from a remote location. This saves you the hassle of having to download things manually.
2. Import HTML Table from a Website
If you’ve ever read the pandas documentation, you probably would’ve come across the pd.read_html()
method. Much like, pd.read_csv()
this too also has the additional feature of exporting data from a URL. What makes this even cooler is that you can provide a URL to a website with many tables and it will scrape them all.
For example, consider the Wikipedia article “Economy of the United Kingdom”. This article can be scraped as follows.
>>> url = 'https://en.wikipedia.org/wiki/Economy_of_the_United_Kingdom'
>>> df_tables = pd.read_html(url)
This article has a total of 33 tables.
>>> len(df_tables)
33
One of the tables contains historical GDP figures.
>>> df_tables[9]
Year GDP (in Bil. US$PPP) GDP per capita (in US$ PPP) GDP (in Bil. US$nominal) GDP per capita (in US$ nominal) GDP growth (real) Inflation rate (in Percent) Unemployment (in Percent) Government debt (in % of GDP)
0 1980 494.4 8777.0 604.0 10722.2 -2.1% 16.8% 7.1% 42.6%
1 1981 537.4 9536.1 587.7 10427.1 -0.7% 12.2% 9.7% 44.8%
2 1982 581.9 10337.3 558.7 9925.6 2.0% 8.5% 10.7% 43.1%
3 1983 630.1 11187.9 532.5 9455.1 4.2% 5.2% 11.5% 41.9%
4 1984 667.4 11832.2 504.6 8944.9 2.2% 4.4% 11.8% 42.2%
5 1985 716.9 12677.0 536.9 9494.3 4.1% 5.2% 11.4% 41.3%
6 1986 754.4 13309.1 655.1 11556.8 3.2% 3.6% 11.3% 41.3%
7 1987 814.8 14343.4 813.0 14312.5 5.4% 4.1% 10.4% 39.3%
8 1988 891.8 15669.5 989.6 17386.5 5.7% 4.6% 8.6% 37.1%
9 1989 950.7 16656.6 1007.4 17650.7 2.6% 5.2% 7.2% 32.5%
10 1990 993.5 17357.8 1193.7 20854.9 0.7% 7.0% 7.1% 28.6%
11 1991 1015.8 17684.8 1247.9 21725.3 -1.1% 7.5% 8.9% 28.5%
12 1992 1043.1 18114.3 1289.1 22385.4 0.4% 4.2% 10.0% 33.2%
13 1993 1094.4 18962.9 1154.1 19997.2 2.5% 2.5% 10.4% 38.0%
14 1994 1160.8 20061.4 1239.7 21425.4 3.8% 2.0% 9.5% 40.8%
15 1995 1215.1 20941.6 1341.9 23125.9 2.5% 2.6% 8.6% 43.7%
16 1996 1268.2 21804.1 1416.8 24359.4 2.5% 2.4% 8.1% 43.8%
17 1997 1354.3 23223.7 1558.5 26726.5 5.0% 1.8% 7.0% 43.3%
18 1998 1420.3 24288.9 1651.8 28247.5 3.7% 1.6% 6.3% 40.9%
19 1999 1487.8 25352.1 1682.6 28672.0 3.3% 1.3% 6.0% 39.4%
20 2000 1574.8 26743.5 1661.3 28212.7 3.5% 0.8% 5.5% 36.7%
21 2001 1654.3 27984.7 1639.1 27727.9 2.7% 1.2% 5.1% 34.0%
22 2002 1716.7 28916.4 1782.9 30032.2 2.2% 1.3% 5.2% 34.2%
23 2003 1808.7 30328.2 2052.8 34421.4 3.3% 1.4% 5.0% 35.4%
24 2004 1899.7 31688.0 2413.1 40252.4 2.3% 1.3% 4.8% 38.4%
25 2005 2017.2 33389.8 2535.6 41971.8 3.0% 2.1% 4.8% 39.6%
26 2006 2135.4 35106.7 2709.8 44549.7 2.7% 2.3% 5.4% 40.5%
27 2007 2244.9 36610.5 3094.6 50467.1 2.4% 2.3% 5.4% 41.5%
28 2008 2281.6 36904.6 2952.3 47753.7 -0.3% 3.6% 5.7% 49.3%
29 2009 2201.7 35363.2 2421.0 38885.1 -4.1% 2.2% 7.6% 63.2%
30 2010 2274.4 36240.1 2484.0 39579.6 2.1% 3.3% 7.9% 74.3%
31 2011 2351.3 37154.2 2660.8 42044.6 1.3% 4.5% 8.1% 80.0%
32 2012 2439.7 38296.7 2704.5 42453.5 1.4% 2.8% 8.0% 83.2%
33 2013 2560.7 39945.1 2785.1 43444.6 2.2% 2.6% 7.6% 84.2%
34 2014 2667.4 41292.5 3067.1 47480.8 2.9% 1.5% 6.2% 86.1%
35 2015 2772.6 42583.1 2933.4 45053.5 2.4% 0.0% 5.4% 86.7%
36 2016 2896.5 44121.1 2703.2 41177.8 1.7% 0.7% 4.9% 86.8%
37 2017 3032.8 45923.4 2664.7 40349.9 1.7% 2.7% 4.4% 86.3%
38 2018 3144.1 47325.6 2861.0 43063.7 1.3% 2.5% 4.1% 85.8%
39 2019 3246.3 48599.0 2833.3 42416.6 1.4% 1.8% 3.8% 85.2%
40 2020 2961.9 44153.9 2709.7 40394.1 -9.8% 0.9% 4.5% 104.5%
3. JSON Normalised
Sometimes when you are working with JSON data, you’ll find that it’s quite common for a JSON object to feature many nested objects. For a JSON object to be converted into tabular data, they need to be flatted (has a nested depth of one).
Conveniently, pandas provide a way to do this using the pd.json_normalize()
method.
For example, consider the following JSON object (as used as an example in the pandas documentation)
>>> data = [
{
"id": 1,
"name": "Cole Volk",
"fitness": {"height": 130, "weight": 60},
},
{"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
{
"id": 2,
"name": "Faye Raker",
"fitness": {"height": 130, "weight": 60},
},
]
It features attributes which contain nested objects (“fitness” in this case). Using the pd.json_normalize()
, this can be collapsed into table. Note: an optional parameter of max_level
can be added to specify the maximum number of nested levels to collapse. By default, it will normalise all levels.
>>> df = pd.json_normalize(data)
>>> df
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mark Reg 130 60
2 2.0 Faye Raker 130 60n
4. Import from Clipboard
This is a rather simple one but let’s suppose you are copying and pasting data from a website into a Word document. The pd.read_clipboard()
method can be used to capture whatever data is stored on your clipboard.
By default, this method accepts a regular expression of \\s+
as a delimiter for separating values. Alternatively, you can use your own regular expression.
>>> df = pd.read_clipboard(sep='\\s+')
This is especially handy as it saves you from writing the data by hand meaning that you’re less prone to making mistakes if it was done manually.
5. Import from Excel
Let’s be honest, everyone has used Microsoft Excel (other spreadsheet software is available) at some point in their lives. If I’m honest, I still use it a lot and it remains one of the most popular ways of sending tabulated data to others.
Let’s suppose that someone has sent some data to you that is stored in an excel spreadsheet, the pd.read_excel()
method can be used to read the data as if it were a simple CSV file.
One thing to bear in mind is that spreadsheet software allows users to segment multiple spreadsheets using sheets within a single file. Accessing data from a specific sheet can be achieved using the sheet_name
parameter.
>>> df = pd.read_excel('my_spreadsheet.xlsx', sheet_name='Sheet 1')
Conclusions
In summary, this blog post features some of the most useful panders shortcuts I use daily. Let me know if I’m missing any obvious ones as I’m sure there are many more than the ones I’ve listed here. Overall, panders is an incredibly useful Python package and it has so many different features for manipulating, reading and writing data.