What is the difference between Scrapy Selenium and Beautiful Soup?
One of the most common questions I see when someone first starts to learn about Scrapy is: What's the difference between Scrapy Selenium and Beautiful Soup? There are a lot of resources out there, especially on the web, that provide answer to this question, but a lot of them are not very clear or simply don't make sense. So I decided to write an article explaining what they are, why they're different, and how they're different. What is Scrapy Selenium? Scrapy Selenium allows you to automate your scraping projects by using the Selenium library. This library is provided by Scrapy to allow you to scrape web pages and make requests to servers that have their API written in Python. In other words, if you want to scrape something, Scrapy Selenium allows you to click, type, send some text, get the response, and save it to a file. How do you start using it?selector import HtmlXPathSelector. >>> from selenium import webdriver. >>> from selenium.webdriver.keys import Keys
>>> from selenium.actionchains import ActionChains >>> from selenium.support.ui import Select
>>> from selenium.ui import WebDriverWait >>> from selenium.by import By >>> from selenium.support import expectedconditions as EC >>> from selenium.desiredcapabilities import DesiredCapabilities >>> from selenium.remote.webdriver import WebDriver
Is Selenium best for web scraping?
Selenium is one of the most used drivers for automated web scraping and it has a good reputation amongst programmers as a very easy to understand programming language. It's easy to pickup and also very quick to get started with. Selenium allows you to execute a web browser and navigate through pages looking for links and data you wish to obtain.
My concern with selenium is that we need to use a framework like phantom.js or Casper.js on top of selenium in order to use things like:
If someone were to create a driver for selenium, they'd have to write out their own version of casper.js or phantom.js because they couldn't just use PhantomJS itself (which I imagine is what Casper is).
Would it not better to use just PHP for web scraping? I would guess yes. If you're creating a scraper, wouldn't you want to be doing it without a framework? If you really needed the framework and ease of use, you could write your own PHP based framework, but it seems this is not what your intent was.
In other words, why use more when you can do less? Another concern is that, if you were to scrap the site using only PHP, it may not work very well. This is because some websites (the ones which require registration to be completed) have forms that will automatically send you back to that same page when you reload the page. You won't see it work correctly until you've been registered, and have logged in. You'll probably need a couple of different sessions of PHP running in order to successfully scrape the page, without crashing your browser.
Using Selenium, you could go through each page in one session, then you'd be able to switch to a new page and begin scraping immediately after.
Related Answers
How can we use the Selenium tool with HeadSpin?
Selenium is a tool that is used to automate functional testing. There are two types...
What are 5 Uses of Selenium?
Selenium is a web-automation tool that helps you to test web applications....
How can we use the Selenium tool with HeadSpin?
Selenium is a cross-browser testing automation framework w...