python:requests-html 一个人性化的HTML解析库
python:requests-html 一个人性化的HTML解析库
6 收藏

requests-html 这个库旨在使解析HTML(例如抓取web)尽可能简单和直观,比较人性化的库。
当使用这个库时,你会自动得到:

  • 完整的JavaScript支持!
  • CSS选择器。
  • XPath选择器,用于模糊的核心。
  • 模拟用户代理(像一个真正的web浏览器)。
  • 自动跟踪重定向。
  • 连接池和cookie持久性。

Installation

C:\Users\lifeng>pip install requests-html
Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl (13 kB)
Collecting fake-useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
  Preparing metadata (setup.py) ... done
Collecting pyppeteer>=0.0.14
  Downloading pyppeteer-0.2.6-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 3.4 kB/s
Collecting pyquery
  Downloading pyquery-1.4.3-py3-none-any.whl (22 kB)
Requirement already satisfied: w3lib in d:\python\python37\lib\site-packages (from requests-html) (1.22.0)
Requirement already satisfied: requests in d:\python\python37\lib\site-packages (from requests-html) (2.25.0)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... done
Collecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.26.2)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.4.4)
Requirement already satisfied: importlib-metadata>=1.4 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (1.7.0)
Collecting pyee<9.0.0,>=8.1.0
  Downloading pyee-8.2.2-py2.py3-none-any.whl (12 kB)
Collecting websockets<10.0,>=9.1
  Downloading websockets-9.1-cp37-cp37m-win_amd64.whl (90 kB)
     |████████████████████████████████| 90 kB 4.9 kB/s
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in d:\python\python37\lib\site-packages (from pyppeteer>=0.0.14->requests-html) (4.62.3)
Requirement already satisfied: beautifulsoup4 in d:\python\python37\lib\site-packages (from bs4->requests-html) (4.8.2)
Requirement already satisfied: cssselect>0.7.9 in d:\python\python37\lib\site-packages (from pyquery->requests-html) (1.1.0)
Requirement already satisfied: lxml>=2.1 in d:\python\python37\lib\site-packages (from pyquery->requests-html) (4.5.0)
Requirement already satisfied: certifi>=2017.4.17 in d:\python\python37\lib\site-packages (from requests->requests-html) (2020.4.5.1)
Requirement already satisfied: idna<3,>=2.5 in d:\python\python37\lib\site-packages (from requests->requests-html) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in d:\python\python37\lib\site-packages (from requests->requests-html) (3.0.4)
Requirement already satisfied: six>=1.4.1 in d:\python\python37\lib\site-packages (from w3lib->requests-html) (1.12.0)
Requirement already satisfied: zipp>=0.5 in d:\python\python37\lib\site-packages (from importlib-metadata>=1.4->pyppeteer>=0.0.14->requests-html) (3.1.0)
Requirement already satisfied: colorama in d:\python\python37\lib\site-packages (from tqdm<5.0.0,>=4.42.1->pyppeteer>=0.0.14->requests-html) (0.4.3)
Requirement already satisfied: soupsieve>=1.2 in d:\python\python37\lib\site-packages (from beautifulsoup4->bs4->requests-html) (2.0.1)
Using legacy 'setup.py install' for bs4, since package 'wheel' is not installed.
Using legacy 'setup.py install' for fake-useragent, since package 'wheel' is not installed.
Using legacy 'setup.py install' for parse, since package 'wheel' is not installed.
Installing collected packages: websockets, pyee, pyquery, pyppeteer, parse, fake-useragent, bs4, requests-html
    Running setup.py install for parse ... done
    Running setup.py install for fake-useragent ... done
    Running setup.py install for bs4 ... done
Successfully installed bs4-0.0.1 fake-useragent-0.1.11 parse-1.19.0 pyee-8.2.2 pyppeteer-0.2.6 pyquery-1.4.3 requests-html-0.10.0 websockets-9.1

教程和使用

  • 使用Requests向'baidu.com'发出GET请求:
from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('https://www.baidu.com/')
    print(r)
  • 抓取页面上所有链接的列表,按原样:
print(r.html.links)

# 运行结果
{'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%8D%E5%B0%91%E5%9C%B0%E5%8C%BA%E7%BB%BF%E5%8F%B6%E8%8F%9C%E4%BB%B7%E6%A0%BC%E5%BC%80%E5%A7%8B%E6%98%8E%E6%98%BE%E5%9B%9E%E8%90%BD&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://xueshu.baidu.com', 'https://b2b.baidu.com/s?fr=wwwt', 'https://baike.baidu.com', '/', 'https://map.baidu.com/?newmap=1&ie=utf-8&s=s', 'https://top.baidu.com/board?platform=pc&sa=pcindex_entry', 'http://tieba.baidu.com/f?fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&fr=wwwt', 'https://wenku.baidu.com', 'http://news.baidu.com', 'https://beian.miit.gov.cn', '//www.baidu.com/duty', 'http://tieba.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E6%96%B9%E6%9A%B4%E9%9B%AA%E5%8D%B3%E5%B0%86%E4%B8%8A%E7%BA%BF&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', '//home.baidu.com', '//www.baidu.com/licence/', 'https://jingyan.baidu.com', 'https://live.baidu.com/', 'http://e.baidu.com/ebaidu/home?refer=887', 'http://map.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8D%97%E6%96%B9%E5%91%A8%E6%9C%AB%E5%88%9B%E5%A7%8B%E4%BA%BA%E5%B7%A6%E6%96%B9%E5%8E%BB%E4%B8%96&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://wenku.baidu.com/search?lm=0&od=0&ie=utf-8', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E6%B2%B3%E5%8C%97%E7%96%AB%E6%83%85%E5%AD%98%E5%A4%9A%E6%9D%A1%E4%BC%A0%E6%92%AD%E9%93%BE+%E6%B6%89%E5%A9%9A%E5%AE%B4%E7%AD%89&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', '//help.baidu.com', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8', 'https://zhidao.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E4%BA%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%B5%B7%E8%AF%89%E6%97%A5%E6%9C%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%8E%B7%E8%83%9C&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5', 'http://image.baidu.com', 'http://www.baidu.com/more/', 'http://ir.baidu.com', 'https://www.hao123.com', 'http://music.taihe.com', 'https://haokan.baidu.com/?sfrom=baidu-top', 'https://pan.baidu.com', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001', 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%AD%E4%BC%81%E4%BA%A7%E5%93%81%E8%A2%AB%E7%BE%8E%E6%96%B9%E6%89%A3%E7%95%99+%E5%A4%96%E4%BA%A4%E9%83%A8%E5%9B%9E%E5%BA%94&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8'}

Process finished with exit code 0
  • 抓取页面上所有链接的列表,以绝对形式:
print(r.html.absolute_links)

# 运行结果
{'http://map.baidu.com', 'https://beian.miit.gov.cn', 'https://www.baidu.com/duty', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8', 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8D%97%E6%96%B9%E5%91%A8%E6%9C%AB%E5%88%9B%E5%A7%8B%E4%BA%BA%E5%B7%A6%E6%96%B9%E5%8E%BB%E4%B8%96&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%8D%E5%B0%91%E5%9C%B0%E5%8C%BA%E7%BB%BF%E5%8F%B6%E8%8F%9C%E4%BB%B7%E6%A0%BC%E5%BC%80%E5%A7%8B%E6%98%8E%E6%98%BE%E5%9B%9E%E8%90%BD&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://map.baidu.com/?newmap=1&ie=utf-8&s=s', 'http://xueshu.baidu.com', 'http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8', 'https://top.baidu.com/board?platform=pc&sa=pcindex_entry', 'https://haokan.baidu.com/?sfrom=baidu-top', 'https://help.baidu.com', 'https://www.hao123.com', 'https://pan.baidu.com', 'https://zhidao.baidu.com', 'https://wenku.baidu.com', 'https://home.baidu.com', 'https://jingyan.baidu.com', 'https://baike.baidu.com', 'http://ir.baidu.com', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&fr=wwwt', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E4%BA%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%B5%B7%E8%AF%89%E6%97%A5%E6%9C%AC%E6%97%A0%E5%8D%B0%E8%89%AF%E5%93%81%E8%8E%B7%E8%83%9C&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001', 'http://image.baidu.com', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E4%B8%AD%E4%BC%81%E4%BA%A7%E5%93%81%E8%A2%AB%E7%BE%8E%E6%96%B9%E6%89%A3%E7%95%99+%E5%A4%96%E4%BA%A4%E9%83%A8%E5%9B%9E%E5%BA%94&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'https://www.baidu.com/licence/', 'http://news.baidu.com', 'http://music.taihe.com', 'http://www.baidu.com/more/', 'https://www.baidu.com/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%8C%97%E6%96%B9%E6%9A%B4%E9%9B%AA%E5%8D%B3%E5%B0%86%E4%B8%8A%E7%BA%BF&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://tieba.baidu.com', 'http://e.baidu.com/ebaidu/home?refer=887', 'http://tieba.baidu.com/f?fr=wwwt', 'https://b2b.baidu.com/s?fr=wwwt', 'https://live.baidu.com/', 'https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E6%B2%B3%E5%8C%97%E7%96%AB%E6%83%85%E5%AD%98%E5%A4%9A%E6%9D%A1%E4%BC%A0%E6%92%AD%E9%93%BE+%E6%B6%89%E5%A9%9A%E5%AE%B4%E7%AD%89&rsv_idx=2&rsv_dl=fyb_n_homepage&hisfilter=1', 'http://wenku.baidu.com/search?lm=0&od=0&ie=utf-8'}

Process finished with exit code 0
  • 用CSS选择器选择一个元素:
print(r.html.find("#kw", first=True))

# 运行结果
<Element 'input' id='kw' name='wd' class=('s_ipt',) value='' maxlength='255' autocomplete='off'>

Process finished with exit code 0
  • 获取一个元素的文本内容:
data = r.html.find(".text-color", first=True)
print(data.text)

# 运行结果
关于百度

Process finished with exit code 0
  • 元素的属性:
data = r.html.find(".text-color", first=True)
print(data.attrs)

# 运行结果
{'class': ('text-color',), 'href': '//home.baidu.com', 'target': '_blank'}

Process finished with exit code 0
  • 渲染一个元素的HTML:
data = r.html.find(".text-color", first=True)
print(data.html)

# 运行结果
<a class="text-color" href="//home.baidu.com" target="_blank">关于百度</a>

Process finished with exit code 0
  • 在一个元素中选择一个元素列表:
data = r.html.find(".text-color", first=True)
print(data.find('a'))

# 运行结果
[<Element 'a' class=('text-color',) href='//home.baidu.com' target='_blank'>]

Process finished with exit code 0
  • 搜索元素中的链接:
data = r.html.find(".text-color", first=True)
print(data.absolute_links)

# 运行结果
{'https://home.baidu.com'}

Process finished with exit code 0
  • 搜索页面上的文本:
print(r.html.search("baidu"))

# 运行结果
<Result () {}>

Process finished with exit code 0
  • 更复杂的CSS选择器示例(从Chrome开发工具复制):
from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('https://www.baidu.com/')

    ele = "li.hotsearch-item:nth-child(1) > a:nth-child(1) > span:nth-child(2)"
    print(r.html.find(ele, first=True).text)

# 运行结果
河北疫情存多条传播链 涉婚宴等

Process finished with exit code 0
  • 还支持XPath:
from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('https://www.baidu.com/')
    print(r.html.xpath('//*[@id="kw"]'))

# 运行结果
[<Element 'input' id='kw' name='wd' class=('s_ipt',) value='' maxlength='255' autocomplete='off'>]

Process finished with exit code 0
  • 你也可以只选择包含特定文本的元素:
from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('https://www.baidu.com/')
    print(r.html.find('a', containing='baidu'))

# 运行结果
[<Element 'a' class=('text-color',) href='http://ir.baidu.com' target='_blank'>]

Process finished with exit code 0

JavaScript支持

也可以抓取一些JavaScript渲染的文本:

from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('http://www.baidu.com/')
    print(r.html.render())

分页

from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('http://news.baidu.com/')

    for html in r.html:
        print(html)

或者你也可以简单地请求下一个URL:

from requests_html import HTMLSession


with HTMLSession() as session:

    r = session.get('http://news.baidu.com/')

    print(r.html.next())

使用没有请求

你也可以使用这个库没有请求:

from requests_html import HTML


doc = """<a href='https://httpbin.org'>"""
html = HTML(html=doc)
print(html.links)

# 运行结果
{'https://httpbin.org'}

Process finished with exit code 0

你也可以在没有请求的情况下渲染JavaScript页面:

from requests_html import HTML


script = """
        () => {
            return {
                width: document.documentElement.clientWidth,
                height: document.documentElement.clientHeight,
                deviceScaleFactor: window.devicePixelRatio,
            }
        }
    """
html = HTML(html=script)
val = html.render(script=script, reload=False)
print(val)

print(html.html)

# 运行结果
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
<html><head></head><body>() =&gt; {
            return {
                width: document.documentElement.clientWidth,
                height: document.documentElement.clientHeight,
                deviceScaleFactor: window.devicePixelRatio,
            }
        }</body></html>

Process finished with exit code 0

使用异步访问网站

  • 尝试async在同一时间获得一些网站:
from requests_html import AsyncHTMLSession


asession = AsyncHTMLSession()
async def get_pythonorg():
    r = await asession.get('https://python.org/')
    return r

async def get_reddit():
    r = await asession.get('https://www.douban.com/')
    return r

async def get_google():
    r = await asession.get('https://www.baidu.com/')
    return r


results = asession.run(get_pythonorg, get_reddit, get_google)
print(results)

# 运行结果
[<Response [200]>, <Response [200]>, <Response [200]>]

Process finished with exit code 0
  • 结果列表中的每一项都是响应对象,可以与之进行交互:
from requests_html import AsyncHTMLSession


asession = AsyncHTMLSession()
async def get_pythonorg():
    r = await asession.get('https://python.org/')
    return r

async def get_reddit():
    r = await asession.get('https://www.douban.com/')
    return r

async def get_google():
    r = await asession.get('https://www.baidu.com/')
    return r


results = asession.run(get_pythonorg, get_reddit, get_google)

for result in results:
    print(result.html.url)

# 运行结果
https://www.python.org/
https://www.baidu.com/
https://www.douban.com/

Process finished with exit code 0

以上总结或许能帮助到你,或许帮助不到你,但还是希望能帮助到你,如有疑问、歧义,直接私信留言会及时修正发布;非常期待你的点赞和分享哟,谢谢!

未完,待续…

一直都在努力,希望您也是!

微信搜索公众号:就用python

编辑于 2021-11-04 · 著作权归作者所有