1

I'm trying to web scrape this web page and all the "next pages" of this search

http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias

When I go to page 2 of the search, I correctly excract all the links.

When I go to a page that doesn't exist, the web site redirects to the first page of the search.

http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000

Example, if I go to page 2500 I don't get an error, which is what I want, I go back to the first page.

Here is a piece of my code:

    try:
        html = urlopen("http://g1.globo.com/busca/?q=economia&cat=a&ss=1885518dc528dd9b&st=G1&species=not%C3%ADcias&page=110") #Search Link
        bsObj = BeautifulSoup(html) #BeautifulSoup's Link
        print(bsObj)
    except OSError:
        print("test")

My objective is to Scrape all the available pages and stop the code after that. To do that, firstly, I need to understand what's going on.

Thanks

2 Answers 2

1

When you reach the last page, the button gets disabled:

 <a data-pagina="2" href="?ss=4da73052cb8296b5&amp;st=G1&amp;q=incerteza+pol%C3%ADtica+economia&amp;cat=a&amp;species=not%C3%ADcias&amp;page=2"
 class="proximo fundo-cor-produto"> próximo</a>
             ^^^^
             # ok

 <a data-pagina="41" href="?ss=4da73052cb8296b5&amp;st=G1&amp;q=incerteza+pol%C3%ADtica+economia&amp;cat=a&amp;species=not%C3%ADcias&amp;page=41"
     class="proximo disabled">próximo</>
             ^^^^
            # no more next pages

So just keep looping until then:

from bs4 import BeautifulSoup
import requests
from itertools import count

page_count = count(1)
soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
disabled = soup.select_one("#paginador ul li a.proximo.disabled")
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
print(soup.select_one("a.proximo.disabled"))
while not disabled:
    soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
    disabled = soup.select_one("#paginador ul li a.proximo.disabled")
    print([a["href"] for a in soup.select("div.busca-materia-padrao a")])

If you were using requests wanted to check if you had been redirected you could access the .history attribute:

In [1]: import requests

In [2]: r = requests.get("http://g1.globo.com/busca/?q=incerteza%20pol%C3%ADtica%20economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000")

In [3]: print(r.history)
[<Response [301]>]
In [4]:  r.history[0].status_code == 301
Out[4]: True

Another way using requests would be to disallow redirects and catch a 301 return code.

soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])

while True:
    r = requests.get(url.format(next(page_count)), allow_redirects=False)
    if r.status_code == 301:
        break
    soup = BeautifulSoup(r.content)
    print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
3
  • I think your logic is correct, but the while condition doesn't stop the code when it reaches the number of pages. Oct 1, 2016 at 11:46
  • @ThalesMarques, yes, I had a typo in my selector, it will work fine now Oct 1, 2016 at 12:55
  • The second code still looping after the last page, but the last code works fine. I will work on something like it. Thank you very much! Oct 1, 2016 at 16:31
-1

You could always store a hash of the response from the first page (if it's actual identical) then check to see if the response of each page matches the hash of the first page.

Additionally you could use urllib2

import urllib2, urllib
opener = urllib2.build_opener()
urllib2.install_opener(opener)
try: 
    response = urllib2.urlopen('http://g1.globo.com/busca/?q=incerteza+pol%C3%ADtica+economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000')
    bsObj = BeautifulSoup(response.read()) #BeautifulSoup's Link
    print(bsObj)

except urllib2.HTTPError, err:
    if err.code == 404:
        print "Page not found!"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.