Saya telah menulis seekor laba-laba yang tujuan utamanya adalah mengekstrak satu nomor dari http://www.funda.nl/koop/amsterdam/, yaitu jumlah halaman maksimum dari pager di bagian bawah (misalnya, angka 255 pada contoh di bawah).
Saya berhasil melakukan ini menggunakan LinkExtractor berdasarkan ekspresi reguler yang cocok dengan URL halaman ini. Laba-laba ditunjukkan di bawah ini:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem
class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le_maxpage, callback='get_max_page_number'),
)
def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
page_numbers=[]
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
page_numbers.append(page_number)
# if page_number > max_page_number:
# max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
max_page_number = max(page_numbers)
print("The maximum page number is %s" % max_page_number)
yield {'max_page_number': max_page_number}
Jika saya menjalankan ini dengan output feed dengan memasukkan scrapy crawl Funda_max_pages -o funda_max_pages.json
pada baris perintah, file JSON yang dihasilkan akan terlihat seperti ini:
[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]
Saya merasa aneh bahwa dict tersebut dikeluarkan 7 kali, bukan hanya sekali. Lagi pula, pernyataan yield
berada di luar perulangan for
. Adakah yang bisa menjelaskan perilaku ini?