In the world of Python and Trio, where producers also double as consumers, the question arises: how can one elegantly exit when

Question

In the world of Python and Trio, where producers also double as consumers, the question arises: how can one elegantly exit when

My goal is to create a basic web crawler using trio and asks. I am utilizing a nursery to launch multiple crawlers simultaneously, and a memory channel to store a list of urls to be visited.

Each crawler is given copies of both ends of the channel so they can retrieve a url (using receive_channel), read its content, discover new urls to visit, and add them back to the channel (via send_channel).

async def main():
    send_channel, receive_channel = trio.open_memory_channel(math.inf)
    async with trio.open_nursery() as nursery:
        async with send_channel, receive_channel:
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())


async def crawler(send_channel, receive_channel):
    async for url in receive_channel:  # I'm a consumer!
        content = await ...
        urls_found = ...
        for u in urls_found:
            await send_channel.send(u)  # I'm a producer too!

In this scenario, the consumers actually act as producers. How can everything be stopped gracefully?

The criteria for shutting down all processes are:

The channel must be empty,
AND
All crawlers must be stuck at the first for loop, waiting for a url that will never appear in receive_channel.

I have attempted using async with send_channel within crawler(), but have not come up with an effective solution. Other approaches, such as implementing a worker pool bound to a memory channel, have also been unsuccessful.

python channel consumer producer python-trio

Answer 1

Answer №1

It appears that there are a couple of issues here.

Firstly, the assumption about stopping when the channel is empty is problematic. Since the memory channel is allocated with a size of 0, it will always be empty. The handoff of a URL can only occur if a crawler is prepared to receive it.

This leads to the second issue. If you happen to discover more URLs than the number of crawlers you have allocated, the application will end up deadlocked.

The reason for this is that if you cannot pass off all the found URLs to a crawler, the crawler will never be ready to receive a new URL to crawl because it is stuck waiting for another crawler to take one of its URLs.

Furthermore, if other crawlers find new URLs, they too will become stuck behind the crawler that is already awaiting to pass off its URLs and they will not be able to process any of the URLs waiting in line.

You can refer to this excerpt from the documentation for additional context:

Assuming these issues are addressed, where should we focus on next?

It might be necessary to maintain a list or set of visited URLs to prevent revisiting them.

To determine when to halt the process, instead of closing the channels, it could be simpler to cancel the nursery altogether.

If we alter the main loop like so:

async def main():
    send_channel, receive_channel = trio.open_memory_channel(math.inf)
    active_workers = trio.CapacityLimiter(3) # Number of workers
    async with trio.open_nursery() as nursery:
        async with send_channel, receive_channel:
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            while True:
                await trio.sleep(1) # Allow the workers to initialize.
                if active_workers.borrowed_tokens == 0 and send_channel.statistics().current_buffer_used == 0:
                    nursery.cancel_scope.cancel() # All tasks completed!

We must then adjust the crawler functions slightly to ensure they consume tokens appropriately.

async def crawler(active_workers, send_channel, receive_channel):
    async for url in receive_channel:  # Acting as a consumer!
        with active_workers:
            content = await ...
            urls_found = ...
            for u in urls_found:
                await send_channel.send(u)  # Also producing!

Other factors to take into account -

In the crawler function, using send_channel.send_noblock(u) may be advisable. Because the buffer is unbounded, there is no risk of encountering a WouldBlock exception. This behavior ensures that a given URL is fully processed and all new URLs are added before other tasks or the parent task attempts to proceed further.

Answer 2