PYTHON笔记六:通过PYTHON获取多个PUBMED搜索页面的邮箱地址

本文记录如何构建多个搜索页面的URL,并使用FOR IN循环来获取每篇文章中的邮箱地址,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import requests
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
from bs4 import BeautifulSoup
import re

url = "XXX" #本文中使用的地址是PUBMED

page = 1
pagelist = list()

for page_id in range(1):
page = page + 1
pagelist.append("&page="+str(page))
#print(pagelist)

urllist = list()
urllist.append(url)
for eachpage in pagelist:
urllist.append(url+eachpage)
# print(url+eachpage)
#print(urllist)
print(len(urllist),"pages were retrieved, ok!")

lstA = list()
for eachurl in urllist:
print("Url:",eachurl)
resp = requests.get(eachurl,headers=headers)
# print(resp.status_code)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text,"html.parser")
tags = soup("button")

for tag in tags:
if tag.get("data-permalink-url") and not tag.get("data-permalink-url") in lstA:
lstA.append(tag.get("data-permalink-url"))
#print(lstA)
#print(len(lstA))
lstB = list()
for link in lstA:
url = link
resp = requests.get(url,headers=headers)
# print(resp.status_code)
resp.encoding = "utf-8"
#print(resp.text)
soup = BeautifulSoup(resp.text,"html.parser")
tags = soup("li")
for tag in tags:
if tag.get("data-affiliation-id"):
lstB.append(tag.contents[1])
#print(lstB)
#print(len(lstB))
maillist = list()
for text in lstB:
if re.search("@",text):
emails = re.findall("\S+@\S+\.[a-zA-Z0-9]+",text)
for email in emails:
if email in maillist:continue
else:maillist.append(email)
#print(maillist)
print(len(maillist),"email address were retrived, done")

代码运行结果:

1
17 email address were retrived, done
  • 本文作者:括囊无誉
  • 本文链接: python-6-get-email/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!