Tidak dapat mengambil konten artikel menggunakan beautifulsoup di python 3.7

Saya melakukan pengikisan web menggunakan beautifulsoup dengan python 3.7. Kode di bawah ini berhasil menggores tanggal, judul, tag tetapi tidak isi artikelnya. Sebaliknya, ia memberikan None.

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page={}'
pages = 32
for page in range(4, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.find_all("a", {"class": "story-card75x1-text"}, href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.thehindu.com"+_href)
            except Exception as e:
                continue

        dateTag = soup.find("span", {"class": "dateline"})
        sauce = BeautifulSoup(resp.text,"lxml")
        tag = sauce.find("a", {"class": "section-name"})
        titleTag = sauce.find("h1", {"class": "title"})
        contentTag = sauce.find("div", {"class": "_yeti_done"})

        date = None
        tagName = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(tag,Tag):
            tagName = tag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {tagName}\n {title}\n {content}\n')

        time.sleep(3)

Saya tidak melihat di mana masalahnya karena saya menulis kelas yang benar di contentTag.

Terima kasih.

web-scraping beautifulsoup python-3.7

Piyush Ghasiya 24.06.2019 sumber

comment

Periksa kembali sumber halaman, yang dibaca oleh Beautiful Soup Anda. Mungkin saja sumber laman tidak berisi tag yang diperlukan. - Argon 24.06.2019

Jawaban (1)

arrow_upward
1
arrow_downward

Saya kira tautan yang ingin Anda ikuti dari halaman pertama hingga halaman dalamnya diakhiri dengan .ece. Saya telah menerapkan logika itu dalam skrip untuk melintasi halaman target yang datanya diambil. Saya telah mendefinisikan penyeleksi untuk konten dengan sedikit berbeda. Sekarang tampaknya berfungsi dengan benar. Skrip berikut hanya mengambil data dari halaman 1. Silakan mengubahnya sesuai kebutuhan Anda.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page=1'
base = "https://www.thehindu.com"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".story-card-news a[href$='.ece']"):
    resp = requests.get(urljoin(base,item.get("href")))
    sauce = BeautifulSoup(resp.text,"lxml")
    title = item.get_text(strip=True)
    content = ' '.join([item.get_text(strip=True) for item in sauce.select("[id^='content-body-'] p")])
    print(f'{title}\n {content}\n')

SIM 24.06.2019

Tidak dapat mengambil konten artikel menggunakan beautifulsoup di python 3.7

Jawaban (1)

Pertanyaan tentang topik tersebut