Python爬虫从入门到实战

发布于 2024-01-15 约15分钟阅读 2341次阅读

Python 爬虫数据分析

文章目录

一、前言：为什么学习爬虫？
二、环境准备
三、HTTP协议基础
四、使用requests库
五、HTML解析：BeautifulSoup
六、实战案例
七、常见问题与反爬策略
八、总结与扩展

一、前言：为什么学习爬虫？

在当今数据为王的时代，网络爬虫已成为获取数据的核心技能之一。无论是市场调研、竞品分析，还是学术研究、机器学习，都需要大量数据作为支撑。

学习爬虫的好处 数据获取能力是数据分析师的基本功，掌握爬虫可以让你不再依赖他人提供的数据，真正做到数据自主。

Python凭借其简洁的语法和丰富的第三方库，成为爬虫开发的首选语言。本文将从最基础的内容讲起，手把手带你掌握Python爬虫的核心技能。

二、环境准备

首先，确保你的Python环境已经安装。然后使用pip安装必要的库：

# 安装爬虫相关库
pip install requests beautifulsoup4 lxml

# 验证安装
python -c "import requests; print('requests OK')"
python -c "import bs4; print('beautifulsoup4 OK')"

库说明

requests - 发送HTTP请求
beautifulsoup4 - 解析HTML/XML
lxml - 高性能HTML解析器

三、HTTP协议基础

在开始写代码之前，我们需要了解HTTP协议的基本概念：

GET请求 - 从服务器获取数据
POST请求 - 向服务器提交数据
Headers - 请求头，模拟浏览器访问
Status Code - 状态码（200成功，404未找到等）

四、使用requests库

requests库是Python最流行的HTTP库，语法简洁易懂：

import requests

# 基本GET请求
response = requests.get('https://www.example.com')
print(f'状态码: {response.status_code}')
print(f'编码: {response.encoding}')
print(f'内容长度: {len(response.text)} 字符')

# 带参数的请求
params = {'page': 1, 'limit': 10}
response = requests.get('https://api.example.com/items', params=params)

# POST请求
data = {'username': 'admin', 'password': '123456'}
response = requests.post('https://www.example.com/login', data=data)

4.1 模拟浏览器访问

很多网站会检测请求头中的User-Agent来防止爬虫，我们需要设置合适的请求头：

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'Connection': 'keep-alive',
}

response = requests.get('https://www.example.com', headers=headers)
print(response.status_code)

五、HTML解析：BeautifulSoup

获取到网页内容后，我们需要解析HTML来提取想要的数据。BeautifulSoup是最常用的HTML解析库：

from bs4 import BeautifulSoup

html = '''

    
        
            文章标题
            这是文章内容
            链接
        
    

'''

# 解析HTML
soup = BeautifulSoup(html, 'lxml')

# 方法1：通过标签名查找
print(soup.h1.text)  # 输出: 文章标题

# 方法2：find() 查找单个元素
div = soup.find('div', class_='article')
print(div.text)

# 方法3：find_all() 查找所有元素
links = soup.find_all('a')
for link in links:
    print(link.get('href'))  # 输出: https://example.com

# 方法4：CSS选择器
titles = soup.select('div.article h1')
print(titles[0].text)

六、实战案例

6.1 爬取简书文章标题

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

url = 'https://www.jianshu.com/'
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'lxml')
articles = soup.find_all('a', class_='title')

for i, article in enumerate(articles, 1):
    print(f'{i}. {article.text.strip()}')

6.2 爬取GitHubTrending

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

url = 'https://github.com/trending/python'
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'lxml')
repos = soup.find_all('article', class_='Box-row')

for repo in repos[:10]:
    # 获取仓库名
    title = repo.find('h2').text.strip().replace('\n', ' ')
    # 获取star数
    stars = repo.find('span', class_='d-inline-block').text.strip()
    print(f'{title} - {stars}')

注意事项 请勿频繁请求网站，遵守robots.txt协议，本文仅供学习交流使用。

七、常见问题与反爬策略

7.1 常见反爬机制

User-Agent检测 - 解决：设置真实的UA
IP限制 - 解决：使用代理IP
验证码 - 解决：打码平台或OCR识别
动态加载 - 解决：使用Selenium或Playwright

7.2 添加请求间隔

import time

# 每请求一次休眠1-3秒
delay = 1 + 2 * random.random()  # 1-3秒随机延迟
time.sleep(delay)

八、总结与扩展

本文介绍了Python爬虫的基础知识，包括：

HTTP协议基础
requests库的使用
BeautifulSoup解析HTML
实战案例演练
常见反爬策略

下篇预告 下一篇文章我们将学习Selenium动态网页爬取，解决JavaScript渲染页面的爬取问题，包括滑动验证码处理、代理IP池构建等高级技巧。

博主

热爱技术的开发者，专注于Python、数据分析与Web开发

← 返回首页