Mitchell Ryan / �� - Web Scraping With Python, 3rd Edition / ��-�� Python, 3-� �� [2024, PDF, ENG] :: RuTracker.org

��-�� Python, 3-� �� [2024, PDF, ENG]

��: 1

tsurijin

��: 3 �� 6 ��

��: 1656

tsurijin · 16-��-24 09:35 (1 �� 10 �� )

Web Scraping With Python, 3rd Edition / ��-�� Python, 3-� ��
�� : 2024
��: Mitchell Ryan / ��
��: O�Reilly Media, Inc.
ISBN: 978-1-098-14535-4
��: ��
��: PDF
��: �� (eBook)
�� : ��
�� : 352
��: If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.
Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server�s response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you�re likely to encounter.
Parse complicated HTML pages
Develop crawlers with the Scrapy framework
Learn methods to store the data you scrape
Read and extract data from documents
Clean and normalize badly formatted data
Read and write natural languages
Crawl through forms and logins
Scrape JavaScript and crawl through APIs
Use and write image-to-text software
Avoid scraping traps and bot blockers
Use scrapers to test your website
�� - �� , �� -��, ��, �� . �� , �� -��, �� . �� -��, �� .
�� I �� -��: �� Python �� -��, �� . �� II �� , �� -��, � �� , ��, ��.
�� HTML-��
�� Scrapy framework
�� , ��
��
��
��
��
�� JavaScript � �� API-��
��
�� -��
�� -��

�� (��)

��

Preface ix
Part I. Building Scrapers
1. How the Internet Works 3
Networking 4
Physical Layer 5
Data Link Layer 5
Network Layer 6
Transport Layer 6
Session Layer 7
Presentation Layer 7
Application Layer 7
HTML 7
CSS 9
JavaScript 11
Watching Websites with Developer Tools 13
2. The Legalities and Ethics of Web Scraping 17
Trademarks, Copyrights, Patents, Oh My! 17
Copyright Law 18
Trespass to Chattels 21
The Computer Fraud and Abuse Act 23
robots.txt and Terms of Service 24
Three Web Scrapers 28
eBay v. Bidder�s Edge and Trespass to Chattels 28
United States v. Auernheimer and the Computer Fraud and Abuse Act 29
Field v. Google: Copyright and robots.txt 31
3. Applications of Web Scraping 33
Classifying Projects 33
E-commerce 34
Marketing 35
Academic Research 36
Product Building 37
Travel 38
Sales 39
SERP Scraping 40
4. Writing Your First Web Scrape 41
Installing and Using Jupyter 41
Connecting 43
An Introduction to BeautifulSoup 44
Installing BeautifulSoup 44
Running BeautifulSoup 46
Connecting Reliably and Handling Exceptions 49
5. Advanced HTML Parsing 53
Another Serving of BeautifulSoup 53
find() and find_all() with BeautifulSoup 55
Other BeautifulSoup Objects 57
Navigating Trees 58
Regular Expressions 62
Regular Expressions and BeautifulSoup 66
Accessing Attributes 67
Lambda Expressions 68
You Don�t Always Need a Hammer 69
6. Writing Web Crawlers 71
Traversing a Single Domain 71
Crawling an Entire Site 75
Collecting Data Across an Entire Site 78
Crawling Across the Internet 81
7. Web Crawling Models 87
Planning and Defining Objects 88
Dealing with Different Website Layouts 91
Structuring Crawlers 96
Crawling Sites Through Search 96
Crawling Sites Through Links 99
Crawling Multiple Page Types 101
Thinking About Web Crawler Models 103
8. Scrapy 105
Installing Scrapy 105
Initializing a New Spider 106
Writing a Simple Scraper 107
Spidering with Rules 108
Creating Items 113
Outputting Items 115
The Item Pipeline 116
Logging with Scrapy 119
More Resources 119
9. Storing Data 121
Media Files 121
Storing Data to CSV 124
MySQL 126
Installing MySQL 127
Some Basic Commands 129
Integrating with Python 132
Database Techniques and Good Practice 135
�Six Degrees� in MySQL 137
Email 140
Part II. Advanced Scraping
10. Reading Documents 145
Document Encoding 145
Text 146
Text Encoding and the Global Internet 147
CSV 151
Reading CSV Files 151
PDF 153
Microsoft Word and .docx 155
11. Working with Dirty Data 159
Cleaning Text 160
Working with Normalized Text 164
Cleaning Data with Pandas 166
Cleaning 168
Indexing, Sorting, and Filtering 171
More About Pandas 172
12. Reading and Writing Natural Languages 173
Summarizing Data 174
Markov Models 178
Six Degrees of Wikipedia: Conclusion 181
Natural Language Toolkit 184
Installation and Setup 184
Statistical Analysis with NLTK 185
Lexicographical Analysis with NLTK 188
Additional Resources 191
13. Crawling Through Forms and Logins 193
Python Requests Library 193
Submitting a Basic Form 194
Radio Buttons, Checkboxes, and Other Inputs 197
Submitting Files and Images 198
Handling Logins and Cookies 199
HTTP Basic Access Authentication 200
Other Form Problems 202
14. Scraping JavaScript 203
A Brief Introduction to JavaScript 204
Common JavaScript Libraries 205
Ajax and Dynamic HTML 208
Executing JavaScript in Python with Selenium 209
Installing and Running Selenium 209
Selenium Selectors 212
Waiting to Load 213
XPath 215
Additional Selenium WebDrivers 216
Handling Redirects 216
A Final Note on JavaScript 218
15. Crawling Through APIs 221
A Brief Introduction to APIs 221
HTTP Methods and APIs 223
More About API Responses 224
Parsing JSON 226
Undocumented APIs 227
Finding Undocumented APIs 228
Documenting Undocumented APIs 230
Combining APIs with Other Data Sources 230
More About APIs 234
16. Image Processing and Text Recognition 235
Overview of Libraries 236
Pillow 236
Tesseract 237
NumPy 239
Processing Well-Formatted Text 239
Adjusting Images Automatically 242
Scraping Text from Images on Websites 245
Reading CAPTCHAs and Training Tesseract 248
Training Tesseract 249
Retrieving CAPTCHAs and Submitting Solutions 256
17. Avoiding Scraping Traps 259
A Note on Ethics 259
Looking Like a Human 260
Adjust Your Headers 261
Handling Cookies with JavaScript 262
TLS Fingerprinting 264
Timing Is Everything 267
Common Form Security Features 267
Hidden Input Field Values 268
Avoiding Honeypots 269
The Human Checklist 271
18. Testing Your Website with Scrapers 273
An Introduction to Testing 274
What Are Unit Tests? 274
Python unittest 275
Testing Wikipedia 277
Testing with Selenium 279
Interacting with the Site 280
19. Web Scraping in Parallel 285
Processes Versus Threads 285
Multithreaded Crawling 286
Race Conditions and Queues 289
More Features of the Threading Module 292
Multiple Processes 294
Multiprocess Crawling 296
Communicating Between Processes 297
Multiprocess Crawling�Another Approach 299
20. Web Scraping Proxies 301
Why Use Remote Servers? 301
Avoiding IP Address Blocking 302
Portability and Extensibility 303
Tor 303
PySocks 305
Remote Hosting 306
Running from a Website-Hosting Account 306
Running from the Cloud 307
Moving Forward 308
Web Scraping Proxies 309
ScrapingBee 310
ScraperAPI 312
Oxylabs 314
Zyte 318
Additional Resources 321
Index 323

Mitchell R. - Web Scraping with Python [2024, EPUB, ENG]

Download

�� magnet-��
10 MB

Rutracker.org �� , � �� -��, �� -��

�� ? (�� .torrent �� ��)

[��] [��]

�� » �� » �� » �� (��)

Loading...

Error

Mitchell Ryan / ������� ����� - Web Scraping With Python, 3rd Edition / ���-��������� � ������� Python, 3-� ������� [2024, PDF, ENG]

Mitchell Ryan / �� - Web Scraping With Python, 3rd Edition / ��-�� Python, 3-� �� [2024, PDF, ENG]