Keluaran umpan tergores berisi keluaran yang diharapkan beberapa kali, bukan hanya sekali

Saya telah menulis seekor laba-laba yang tujuan utamanya adalah mengekstrak satu nomor dari http://www.funda.nl/koop/amsterdam/, yaitu jumlah halaman maksimum dari pager di bagian bawah (misalnya, angka 255 pada contoh di bawah).

masukkan deskripsi gambar di sini

Saya berhasil melakukan ini menggunakan LinkExtractor berdasarkan ekspresi reguler yang cocok dengan URL halaman ini. Laba-laba ditunjukkan di bawah ini:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem

class FundaMaxPagesSpider(CrawlSpider):
    name = "Funda_max_pages"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'),
    )

    def get_max_page_number(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        page_numbers=[]
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                page_numbers.append(page_number)
                # if page_number > max_page_number:
                #     max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        max_page_number = max(page_numbers)
        print("The maximum page number is %s" % max_page_number)
        yield {'max_page_number': max_page_number}

Jika saya menjalankan ini dengan output feed dengan memasukkan scrapy crawl Funda_max_pages -o funda_max_pages.json pada baris perintah, file JSON yang dihasilkan akan terlihat seperti ini:

[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]

Saya merasa aneh bahwa dict tersebut dikeluarkan 7 kali, bukan hanya sekali. Lagi pula, pernyataan yield berada di luar perulangan for. Adakah yang bisa menjelaskan perilaku ini?


person Kurt Peek    schedule 14.07.2016    source sumber


Jawaban (2)


  1. Laba-laba Anda menuju ke start_url pertama.
  2. Menggunakan LinkExtractor untuk mengekstrak 7 url.
  3. Unduh masing-masing dari 7 url tersebut dan panggil get_max_page_number pada setiap url tersebut.
  4. Untuk setiap url get_max_page_number mengembalikan kamus.
person Granitosaurus    schedule 14.07.2016

Sebagai solusinya, saya telah menulis output ke file teks untuk digunakan sebagai pengganti output feed JSON:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

class FundaMaxPagesSpider(CrawlSpider):
    name = "Funda_max_pages"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'),
    )

    def get_max_page_number(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                print("The link is %s" % link.url)
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                if page_number > max_page_number:
                    max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        print("The maximum page number is %s" % max_page_number)
        place_name = link.url.split("/")[-3]                                # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
        print("The place name is %s" % place_name)
        filename = str(place_name)+"_max_pages.txt"                         # File name with as prefix the place name
        with open(filename,'wb') as f:
            f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file
        yield {'max_page_number': max_page_number}

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(FundaMaxPagesSpider)
process.start() # the script will block here until the crawling is finished

Saya juga mengadaptasi laba-laba untuk menjalankannya sebagai skrip. Script akan menghasilkan file teks amsterdam_max_pages.txt dengan satu baris max_page_number: 257.

person Kurt Peek    schedule 14.07.2016
comment
Anda masih merayapi 7 url, tetapi Anda mengganti file yang sama dengan max_page_number: 257 7 kali... - person Granitosaurus; 14.07.2016