Automatic redirect when webscrape

Question

I'm trying to web scrape this web page and all the "next pages" of this search

http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias

When I go to page 2 of the search, I correctly excract all the links.

When I go to a page that doesn't exist, the web site redirects to the first page of the search.

http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000

Example, if I go to page 2500 I don't get an error, which is what I want, I go back to the first page.

Here is a piece of my code:

    try:
        html = urlopen("http://g1.globo.com/busca/?q=economia&cat=a&ss=1885518dc528dd9b&st=G1&species=not%C3%ADcias&page=110") #Search Link
        bsObj = BeautifulSoup(html) #BeautifulSoup's Link
        print(bsObj)
    except OSError:
        print("test")

My objective is to Scrape all the available pages and stop the code after that. To do that, firstly, I need to understand what's going on.

Thanks

Padraic Cunningham · Accepted Answer · 2016-10-01 12:55:39Z

When you reach the last page, the button gets disabled:

 <a data-pagina="2" href="?ss=4da73052cb8296b5&amp;st=G1&amp;q=incerteza+pol%C3%ADtica+economia&amp;cat=a&amp;species=not%C3%ADcias&amp;page=2"
 class="proximo fundo-cor-produto"> próximo</a>
             ^^^^
             # ok

 <a data-pagina="41" href="?ss=4da73052cb8296b5&amp;st=G1&amp;q=incerteza+pol%C3%ADtica+economia&amp;cat=a&amp;species=not%C3%ADcias&amp;page=41"
     class="proximo disabled">próximo</>
             ^^^^
            # no more next pages

So just keep looping until then:

from bs4 import BeautifulSoup
import requests
from itertools import count

page_count = count(1)
soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
disabled = soup.select_one("#paginador ul li a.proximo.disabled")
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
print(soup.select_one("a.proximo.disabled"))
while not disabled:
    soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
    disabled = soup.select_one("#paginador ul li a.proximo.disabled")
    print([a["href"] for a in soup.select("div.busca-materia-padrao a")])

If you were using requests wanted to check if you had been redirected you could access the .history attribute:

In [1]: import requests

In [2]: r = requests.get("http://g1.globo.com/busca/?q=incerteza%20pol%C3%ADtica%20economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000")

In [3]: print(r.history)
[<Response [301]>]
In [4]:  r.history[0].status_code == 301
Out[4]: True

Another way using requests would be to disallow redirects and catch a 301 return code.

soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])

while True:
    r = requests.get(url.format(next(page_count)), allow_redirects=False)
    if r.status_code == 301:
        break
    soup = BeautifulSoup(r.content)
    print([a["href"] for a in soup.select("div.busca-materia-padrao a")])

I think your logic is correct, but the while condition doesn't stop the code when it reaches the number of pages. — Thales Marques, Oct 1, 2016 at 11:46
@ThalesMarques, yes, I had a typo in my selector, it will work fine now — Padraic Cunningham, Oct 1, 2016 at 12:55
The second code still looping after the last page, but the last code works fine. I will work on something like it. Thank you very much! — Thales Marques, Oct 1, 2016 at 16:31

John Everden · Accepted Answer · 2016-09-30 03:18:00Z

You could always store a hash of the response from the first page (if it's actual identical) then check to see if the response of each page matches the hash of the first page.

Additionally you could use urllib2

import urllib2, urllib
opener = urllib2.build_opener()
urllib2.install_opener(opener)
try: 
    response = urllib2.urlopen('http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000')
    bsObj = BeautifulSoup(response.read()) #BeautifulSoup's Link
    print(bsObj)

except urllib2.HTTPError, err:
    if err.code == 404:
        print "Page not found!"

Collectives™ on Stack Overflow

Automatic redirect when webscrape

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python-3.x
web-scraping
beautifulsoup
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python-3.xweb-scrapingbeautifulsoup or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python-3.x
web-scraping
beautifulsoup
or ask your own question.