For beginners, figuring out the knowledge system of a field is often much more important than simply learning a certain technology, because technology will always follow the times of rapid change, while the knowledge system tends to change less, today we take a self-learning perspective to understand the knowledge system of Python crawler.
- The basic steps of python crawler to extract information.
- get the data
- parsing data
- extract data
- a save the dat
2.python crawler learning framework
- requests library
The main function of the requests library is to simulate the browser to send requests to get web data. The most important method is requests.get() method, followed by three important properties: requests.text (to get the text of the web page), requests.content (to get the binary content), requests.encoding (to change the encoding of the web page).
- beautifulsoup library
beautifulsoup library is the main function of parsing web pages and information extraction. In fact, we usually use less in our daily work, because it is not convenient enough! You heard it right, although beautifulsoup library can also achieve most of the web page information extraction, but I still recommend Xpath (need to use with Google Chrome xpath plug-in), there is also the famous regular expression (re), but less frequently used.
import requests,time #Load the library
from lxml import etree #load the parser needed for xpath
for page in range(1,2): #Crawl pages
url = f “https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html”
headers = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36”
} # url and request header
#f “https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html” is the same as “https://www.fabiaoqing.com/biaoqing/lists/page/{}.html”.format( page) usage is the same.
r = requests.get(url,headers=headers) #Send network request
time.sleep(1) #time library, used to wait for 1 second to reduce the pressure on the web server
print(r.status_code) #Print status code
html = etree.HTML(r.text) #Put the text of the page into the parser
results = html.xpath(“//div[@class=’tagbqppdiv’]/a/img”) #Extract the target URL
for result in results: # iterate through the sequence
tpurl = result.xpath(“. /@data-original”)[0] #extract URL text
print(tpurl) #Print the URL
tp = requests.get(tpurl,headers) #Get images
with open(tpurl[-10:], “wb”) as f: #Save
f.write(tp.content) #Save as binary content
f.close() #Close the file
Learn the first two libraries and xpath selector, then congratulations, you have mastered at least 60% of the web crawling method.
- asynchronous loading data extraction (Ajax asynchronous)
For example, NetEase cloud music, QQ music, etc.. Here to be divided into two cases, one is asynchronous loading, one is the algorithm encryption. Asynchronous loading just need to find the real URL of the request sent through the browser’s XHR option, using json data extraction (the same method as the dictionary format); algorithmically encrypted data directly using the selenium library to obtain, so as not to lose hair.
- with parameters to request the extraction of page flip data (processing page flip)
In the processing of asynchronous loading, generally need to deal with the URL page flip, this time you need to check a few more behind the URL, summarize the URL page flip rule, and then send a request through the URL parameters can be.
Learn the first 4 steps, then congratulations, you have mastered at least 80% of the web crawling method.
- cookies and session (processing login and comments)
Some sites have non-public data, such as Taobao, Ctrip, etc., need to log in, then you need to use python to simulate login, of course, you can also simulate sending comments.
- 6, selenium library
selenium library is through the command browser work, indirectly get the web page information, can ignore the web page encryption and fast processing login, the advantages are quite obvious, highly recommended, but need to configure the browser driver.
Learn the first 6 steps, then congratulations, you have mastered at least 90% of the web crawling method.
- gevent library and queue module (multi-processing asynchronous and queue)
If you have large-scale data need to obtain, you can learn to use these two modules, non-professional workers can be directly skipped.
- scrapy library
scrapy is a very good web crawler framework, should have heard of it! Suitable for large-scale data extraction, self-learners can try it.
Learn the first 8 steps, then congratulations, you have mastered the large-scale data crawling method.