PYTHON笔记四:通过PYTHON爬取PUBMED文章中的邮箱地址

本文介绍如何通过PYTHON爬取PUBMED文章网页中的邮箱地址,方便后续与通讯作者取得联系。

爬取地址:https://pubmed.ncbi.nlm.nih.gov/

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
url = "https://pubmed.ncbi.nlm.nih.gov/31972133/"
import requests
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
resp = requests.get(url,headers=headers)
#print(resp.status_code)
resp.encoding = "utf-8"
#print(resp.text)

from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text,"html.parser")
tags = soup("li")
#import os
#lst = list()

import re

lst = list()
for tag in tags:
if tag.get("data-affiliation-id"):
lst.append(tag.contents[1])
#print(lst)

maillist = list()
for text in lst:
if re.search("@",text):
emails = re.findall("\S+@\S+",text)
# print(emails)
if emails in maillist:continue
else:maillist.append(emails)
print(maillist)
print(len(maillist),"email address were retrived, done")

代码运行后,结果如下:

1
2
[['xxx.xxx@sxxx.com.']]
1 email address were retrived, done
  • 本文作者:括囊无誉
  • 本文链接: python-4-get-email/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!