There is an abundance of freely available datasets on the internet for pretty much anything and everything you can think of. In a previous blog post, we covered five different platforms for finding useful datasets.
However, as mentioned in the post, there are times when you may need to collect data where it may be necessary to crawl and scrape information from webpages if there is no central dataset/endpoint available. In this case, web scraping is a sensible solution
Note: For the record, by web scraping/crawling, I refer to the specific task of performing a HTTP request for a specific webpage and parsing a HTML document for the exact element.
While web scrapping itself is a perfectly legal and safe thing to do, there are a few considerations that are needed to ensure that you are being respectful and considerate to the people who host the website.
This blog post covers some of the most important considerations to make when building your own web scraper/crawler.
robots.txt is a small text file which is placed at the root directory of a web server and provides a set of rules for what web crawlers can and cannot crawl. Click here for more info. Sometimes, the creator of a website may set certain rules allowing specific web crawlers access to some pages while blocking access to others.
For example, the web crawler Google uses for indexing web pages for its search engine is a good example of a web crawler which follows the rules defined in the file. Web developers can determine which pages they do not wish to be crawled by Google.
robots.txt file is to be used for all web crawlers. For this reason, it is important to check and see if the page you would like to scrape has not been listed in the
robots.txt files otherwise, you will not be permitted to crawl the page.
2. Don’t hammer with requests
If you’re regularly scraping data from the save website, you may wish to consider reducing the number of requests you make within a short period. One or two requests are fine however if you’re spamming a server with thousands of requests per minute then the server could very well come offline or, even worse, your IP address could be blocked as a result making it impossible for you to scrape in future.
Bear in mind that you are not the only person using the website as the server is designed to handle multiple requests at once meaning that if the website comes offline, others will miss out too so think clearly about how many requests you make.
A way to get around this would be to scrape in small bursts. Programming languages such as Python include the
sleep() function which can be inserted into your code and will momentarily pause the web scraper for a given period.
3. Consider using a set of proxies
Another way to be considerate when web scraping would be to consider using a proxy. This can be easily done using software such as proxychains or Tor whereby your outgoing traffic passes through multiple proxies and, in the case of Tor, multiple layers of encryption.
The main reason for doing this is it makes it harder for your request to be traced such that your privacy is maintained. This is particularly important if you are dealing with sensitive data or if you do not want your true identity to be revealed.
With this in mind, please don’t use this solution as a means to cause criminal damage. This is not for encouraging criminal activity but serves as another option to consider when scraping data.
4. Alternating user agents
Depending on what software you use, you can define a custom user agent as a HTTP request header to resemble a real web browser when a server receives your request.
Typically, the user agent header parameter is used to inform a web server of the web browser, operating system and vendor you are using when you send the request.
Much like proxies, by using an alternative (or even fake) user agent, you can keep yourself anonymous but more importantly, you can convince the web server that you are an actual web browser like Chrome or Firefox.
This also has the added benefit of potentially being able to access content which is only available to web browsers and can be easily done using package such as
requests in Python.
5. Use cached data
If you find yourself scrapping the same data from a website regularly, you may want to consider caching the data. This can be easily achieved by storing an offline copy locally.
As a result, caching data locally can save you from making unnecessary requests to a web server and only scrape when an update has been made. Furthermore, this can be used to speed things up meaning that you don’t have to send a request out every time you need to analyse the data.
6. Consider using an API
Finally, before even considering scraping a website, it’s worth looking to see if an API is available to use. This will make things much easier when it comes to collecting and processing data and it will save you the job of having to scrape the data manually. It may not be as obvious but there are occasions where websites provide an API.
This blog post covers some of the most important things to consider when building a web scraper for scraping data manually off the web. As mentioned before, there are many occasions when a web scraper is needed to collect data off the web and to make it more accessible to others.
The reality is, that 99% of the time web scraping will result in no damage being done (especially when considering the points in this blog post). However, there is always that 1% risk where people may not be happy with what you’re doing. It’s ultimately up to you to decide which points to consider but just remember to be civil and to respect the people behind the website.