Should I use Scrapy or selenium?
I'm trying to scrape some content from a website.
I'm using python 3.5.1, selenium to open the webpage and scrapy to do the actual work. But, I don't know what to choose between them. I want the solution that is as fast as possible with minimal memory usage.
In Scrapy, I can only use it for a single webpage at a time. But, in selenium, I can use it to scrape all of the pages in a site.
Is there any disadvantages if I use Scrapy and why should I use one over the other? Scraping from a single webpage can be done faster with Selenium. Scraping the whole website with selenium is slower because it will need to download all of the pages individually. In my experience, a scraper should be faster than selenium.
How much does Scrapy cost?
You can buy Scrapy through the website at scrapy.
Com.
Or, you can buy Scrapy through IndieHackers. Or, you can buy Scrapy as a downloadable app for your phone or tablet. It's available on Google Play and Apple's App Store.
Scrapy was created by Chris Adcock. Scraping with Scrapy was first published in February 2024. You can check out the full publication on Scraping with Scrapy.
The first version of Scrapy was made public on GitHub on January 29, 2024. You can see the repository here.
The first public release of Scrapy came out at PyCon Australia 2024. And, you can read a summary of the release at this blog post.
Here are some sample pages. Try them out for free and then decide whether Scrapy is right for you.
Getting Started. I've had Scrapy running for months. I'm not getting paid for this guide. This is just my feedback on Scrapy.
There are three main things you need to do to get started. Install Scrapy. The easiest way to get Scrapy installed is through the Python Package Index. You can see all the different ways to install Scrapy here.
The easiest way to install Scrapy on Ubuntu or Debian is to use pip. So, you don't need to install anything else.
Configure Scrapy. For this step, you can install a sample project or you can download my Scrapy tutorial repo. The project code is available here. You can use it as your starting point. Or, if you'd like to start from scratch, you can clone the Scrapy tutorial repo to get started.
Is Scrapy better than Beautiful Soup?
If you're anything like me, you don't pay much attention to any new technology until it's really old and well established.
Sometimes that means waiting for someone else to build your dream, but most of the time I just have to do something myself. Case in point: I've used the Python web scraping framework Beautiful Soup 4-7 times in the past year (see list below), but I only recently started using it a third time in my projects. The reason? Something I'm always trying to find out is which tool is better, more mature, faster, or more flexible. Well, I can finally say that the answer is scrapy (scraping framework).
As I'm sure you know, scraping web content (finding resources on a website) requires crawling all of the links on the page, parsing HTML, finding elements, and doing other work as required. I've found Beautiful Soup to be great for quickly parsing through pages and extracting data, but often not for parsing through the contents of a link.
The Scrapy Framework. Scrapy is a fantastic framework for web crawling with many different options available for different use cases. However, I'll cover the basics here.
The basic way to write your crawler is to implement a spider. A spider is an object which has a starturls() function that returns a list of URLs that the spider will scrape. In your startrequests() function, you need to do the actual work:
Import scrapy # Create a spider for all websites in # In order to parse the first page, you need a Request object. From there, you can extract # metadata about the site and start scraping with StartRequest.crawl(request):
This is what the spider does: Crawl the website and extract all links that start with /foo/, /bar/, or /something/. Find all urls that have an href attribute in their href tag. Make a request to each of these links. This is pretty simple, but if you're interested in learning more, there's a lot of interesting information at the scrapy documentation. I encourage you to check it out.
So, let's try to use it. First, we'll create a spider for all URLs starting with /foo/: from scrapy.
Related Answers
Should I use Scrapy or BeautifulSoup?
I'm working on a scraping project using Python, and have been looking...
Is Web Scraping Free?
I was wondering if web scraping is a good project to work on. I'd like to g...
How do you scrape specific data from a website in Python?
This is a question that has come up before, but I am trying to find a defin...