
Does Twitter ban scraping?
I wrote a script that uses the Twitters API to get the recent tweets of some selected users.
I was looking for an easy way to use the Twitter REST API to get the same info. But it seems that Twitter has forbidden scraping by adding an extra header in response to my request.
So what is the best way to do this? Do you have any examples of how to get the most recent 100 tweets from a given user? Thanks for the answer. But is it possible to do a request and get the XML file, instead of the JSON files? As far as I know, the JSON files are easy to parse, but there are a few differences in the XML files, so they are harder to parse.
I guess it is possible to add the header for a GET request. But it should be rather difficult to get the whole XML file with the headers added (which I would need to access the full XML response).
This may or may not be true depending on how Twitter implements the API. But I have been able to use the api to find what you're after. This is what I did: Make a request to get your profile. The response will have the necessary information.
Make another request to get your most recent messages. Save the text of the message to a file, you can easily read the file for the contents. You can also use the -u option to get only the date of the message.
Can you scrape Twitter with Beautifulsoup?
This seems very common, but I couldn't find it on the net.
Do you happen to know how this would be done? It seems to work as follows: given an article, write a URL to the Twitter page, and get a set of tweets that start with the text ' The problem is getting at a page that's generated from multiple calls to a single URL. Any tips? Thanks
For the URL's from which Tweets will be generated I will try to use Twitter Webhooks. The basic idea would be to add the "Content-Length" HTTP header with a value > 0 so the Tweets generated on the fly won't fail. Let's assume that you need to receive Tweets only on some URLs (eg from /twitter/home). You would set up those URLs in your twitter webhook application by creating a new event for all URLs ending with /twitter/home and add the "Content-Length" header you specified.
Then, in the callback URL you'd have an event listener which parses the content and generates some Tweets. This means the content you have can not be an HTML page (even if it has the proper headers) since it would generate empty Tweets or fail to handle an error.
I don't think that you could scrape (ie request a URL and fetch the content) tweets starting with "http://" because it would be breaking twitter webhook rules.
Related Answers
Why can't beautifulsoup see some HTML elements?
There is a lot of questions here, about both selenium and beautifuls...
How long does web scraping take?
As we know, data web scraping is a process of extracting data fro...
Is BeautifulSoup included in Python?
BeautifulSoup is a Python module for parsing HTML and XML. You...