Web Scraping (BeautifulSoup)

Notice

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

CODENAVY

Web Scraping (BeautifulSoup) 본문

Data Crawling

Web Scraping (BeautifulSoup)

codenavy 2021. 10. 28. 17:16

Workflow

1) 웹페이지에서 소스 코드를 다운로드 받는다. (requests 모듈 사용)

2) 소스 코드 중에서 특정 태그를 찾는다. (bs4의 BeautifulSoup 모듈 사용)

3) 태그가 저장하고 있는 정보를 추출한다.

태그는 시작 태그와 끝 태그로 구성되어 있으며, 일반적으로 text는 이 사이에 저장되어 있다.

(태그 중에서 끝 태그가 없는 경우도 있음. ex: meta 태그)

ex) <title>역사란 무엇인가</title>

Chapter 1. 소스코드 다운받기

url = http://www.yes24.com/Product/goods/61385099
import requests
r = requests.get(url)
r.text # 소스코드가 저장되어 있음
print(r.text)

Chapter 2. 원하는 정보를 담고 있는 태그 찾기 (예: title 태그)

2.1. 태그 이름으로 문자열(text) 정보 추출하기

# 태그의 이름을 사용하는 방법(접근하고자 하는 태그가 unique한 경우)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml') # parser의 이름
title = soup.title.text
title = title.split('-')[0].strip() # 제목만 추출

# find(), find_all()을 사용하는 방법
soup.find('p') # 해당 태그 이름을 가지는 가장 첫 번째 정보만 추출
soup.find_all('p') # 해당 태그 이름을 가지는 모든 정보 추출

2.2. 태그 이름+속성으로 문자열(text) 정보 추출하기

from bs4 import BeautifulSoup

with open('html_test.html', 'r', encoding='utf-8') as f:
	html = f.read()

soup = BeautifulSoup(html, 'lxml') # parser의 이름
soup.find('p', attrs={'class':'description'}) # 해당 태그 이름, 속성을 가지는 첫 번째 정보만 추출
soup.find('p', attrs={'class':'description'}) # 해당 태그 이름, 속성을 가지는 모든 정보 추출

2.3. 태그 이름+속성으로 속성 정보 추출하기

# a 태그의 class=BS_Korean 속성이 가지는 href 정보를 추출
soup.find('a', attrs={'class':'BS_Korean'}).get('href')

[Practice]

YES24 '갯마을 차차차2' meta 태그에서 title 정보 추출하기

import requests
from bs4 import BeautifulSoup

url = 'http://www.yes24.com/Product/Goods/104105224'
html = requests.get(url)

soup = BeautifulSoup(html.text, 'lxml')
soup.find('meta', attrs={'name':'title'}).get('content') # 결과: '갯마을 차차차 2 - YES24'

[Practice]

YES24 '어떤 죽음이 삶에게 말했다' 책제목, 저자, 출판사, 출간일, 판매가, 할인율 정보 추출하기

import requests
from bs4 import BeautifulSoup

url = 'http://www.yes24.com/product/goods/96971128'
html = requests.get(url).text

soup = BeautifulSoup(html, 'lxml')

# 책제목, 저자, 출판사, 출간일, 판매가, 할인율 정보를 추출하라.
title = soup.find('meta', attrs={'name':'title'}).get('content').split('-')[0].strip()
author = soup.find('meta', attrs={'name':'author'}).get('content')
publisher = soup.find('span', attrs={'class':'gd_pub'}).text
pubdate = soup.find('span', attrs={'class':'gd_date'}).text
price = soup.find('span', attrs={'class':'nor_price'}).text
discount = soup.find_all('td')[1].text.strip().split('  ')[1].split()[0].replace("(", "")

with open('book_info.txt', 'w', encoding='utf-8') as f:
    f.write(','.join(['title', 'author', 'publisher', 'pubdate', 'price', 'discount'])+'\n')
    f.write(','.join([title, author, publisher, pubdate, price, discount]))

Chapter 3. BeautifulSoup의 Navigation 이용하기

- 태그에 직접적으로 접근하기 어려운 경우, 접근 가능한 태그에 먼저 접근하고, 해당 태그를 starting point로 삼아 원하는 태그로 이동하는 방법.

- Tags are a parent, child, and siblings of other tags.

- .contents, .parent, .next_sibling(or .previous_sibling), .find_next_siblings()

with open('html_test.html', 'r', encoding='utf8') as f:
    html = f.read()

soup = BeautifulSoup(html, 'lxml')
soup.p.next_sibling.next_sibling

CODENAVY

Web Scraping (BeautifulSoup) 본문

Web Scraping (BeautifulSoup)

티스토리툴바