Are you wasting your time for Scrapping the Web

Using synchronous code in python….

Is your python web scrapping code taking a very long time for downloading the content? Are you trying to find solutions to optimize your code? You might think of using Multiprocessing or Multithreading, but the most useful technique in such scenarios is using asynchronous code.

Let me explain to you what is difference between synchronous coding and asynchronous coding so that one can better understand how the time is properly utilized using asynchronous coding. Let us assume a scenario where data is scrapped from multiple pages using the requests library in python.

Here in the above fig. as you can see, each of the tasks requests the server and will have to wait for the response from the server. At this time the CPU resources are idle until the response has been received. After this only the next task can make a request. In Asynchronous code, this is not the case. Asynchronous code takes advantage of the idle time between sending the request and waiting for the response. This is shown in below fig.

 

As we can see in asynchronous code, the individual tasks does not wait for the response of the task ahead of it.The requests are made by the tasks simultaneously one after the other.When all the tasks have made requests, then the time server starts sending the response as they are ready.The responses are not received in the manner as they have been sent.Rather they are received on when they are ready.Hence in the fig. It is mentioned as “ND”(Order is not defined.).But one as a programmer does not have to think about the order in which responses are received,python language orders it in the manner as they were sent. One can see a drastic difference in execution time of synchronous and asynchronous code if tasks are very large in number. 

Python has support for asynchronous web scrapping and can be achieved using the libraries asyncio , aiohttp .Let us see a practical demo of Synchronous Vs Asynchronous coding in python.

The usecase is to scrape data from multiple urls.

Synchronous Python Code using requests. import requests import time

def scrape():

    text_list = []

    for number in range(500):

        url = f”url_to_scrape_{number}”

        resp = requests.get(url)

        resp_text = resp.text

        text_list = text_list.append(resp_text)

    return text_list

if “__name__” == “__main__”:

    start = time.time()

    text_list = scrape()

    print(f”time required : {time.time() – start} seconds”)

Asynchronous Python code : 

import asyncio

import aiohttp

import time

async def get_content(session, url):

    async with session.get(url) as resp:

        text = await resp.text()

        return text

async def scrape():

    async with aiohttp.ClientSession() as session:

        tasks = []

        for number in range(500):

            url = f”url_to_scrape_{number}”

            tasks.append(get_content(session, url))

        text_list = await asyncio.gather(*tasks)

        return text_list

if “__name__” == “__main__”:

    start = time.time()

    text_list = asyncio.run(scrape())

    print(f”time required : {time.time() – start} seconds”)

You can use the asyncio code template as above and modify it according to your own need.Try executing both of the above codes (Just remember to change the url and manipulate the response as per need) for large number of urls and you will notice a large execution time difference in both of above codes. 

You can see Asncio documentation at “https://docs.python.org/3/library/asyncio.html”