Scrapy - hasil terakhir saja

Saya hampir menyelesaikan program yang sulit ini, kecuali untuk masalah terakhir ini. Saya sedang mencoba

ulangi daftar pada masing-masing beberapa entri pada satu halaman
ekstrak sepotong data pada halaman daftar pertama ini untuk setiap entri ['RStation']
masukkan setiap url entri melalui href-nya
ekstrak beberapa data dengan mengulangi daftar di halaman berikutnya
buat satu item dengan data dari halaman utama dan halaman berikutnya

Masalahnya adalah ketika saya membuka csv saya, saya hanya melihat duplikat dari entri terakhir pada daftar iterasi ke-2 (untuk setiap entri dari daftar pertama).

Apakah saya salah menambahkan item atau salah menerapkan respon.meta dalam beberapa hal? Saya mencoba mengikuti dokumentasi untuk respon.meta dan saya tidak mengerti mengapa ini tidak berhasil.

Bantuan apa pun akan sangat dihargai.

import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from fspeople.items import FspeopleItem

class FSSpider(scrapy.Spider):
name = "fspeople"
allowed_domains = ["fs.fed.us"]
start_urls = [
    "http://www.fs.fed.us/research/people/people_search_results.php?3employeename=&keywords=&station_id=SRS&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=RMRS&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PSW&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PNW&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=NRS&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=IITF&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=FPL&state_id=ALL",
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=WO&state_id=ALL"
]
def __init__(self):
    self.i = 0

def parse(self,response):
    for sel in response.xpath("//a[@title='Click to view their profile ...']/@href"):
        item = FspeopleItem()
        url = response.urljoin(sel.extract())
        item['RStation'] = response.xpath("//table[@id='table_id']/tbody/tr/td[2]/i/b/text() | //table[@id='table_id']/tbody/td[2]/text()").extract_first().strip()
        request = scrapy.Request(url, callback=self.parse_post)
        request.meta['item'] = item
        yield request
    self.i += 1

def parse_post(self, response):
    theitems = []
    pubs = response.xpath("//div/h2[text()='Featured Publications & Products']/following-sibling::ul[1]/li | //div/h2[text()='Publications']/following-sibling::ul[1]/li")
    for i in pubs:
        item = response.meta['item']
        name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip()
        pubname = i.xpath("a/text()").extract_first().strip()
        pubauth = i.xpath("text()").extract_first().strip()
        pubURL = i.xpath("a/@href").extract_first().strip()
        #RStation = response.xpath("//div[@id='right-float']/div/div/ul/li/a/text()").extract_first().strip()

        item['link'] = response.url
        item['name'] = name
        item['pubname'] = pubname
        item['pubauth'] = pubauth
        item['pubURL'] = pubURL
        #item['RStation'] = RStation

        theitems.append(item)
    return theitems

python scrapy

Chris 23.04.2016 sumber

comment

Anda mengganti __init__ tetapi Anda tidak memanggil super untuk scrapy.Spider - Steven Almeroth 25.04.2016

comment

Anda mengulangi item yang sama di loop Anda. Coba `item = respon.meta.get ('item') - Steven Almeroth 25.04.2016

Jawaban (1)

arrow_upward
0
arrow_downward

buat instance item baru untuk setiap iterasi.

def parse_post(self, response):
    [...]
    for i in pubs:
        item = response.meta['item']
        item = item.copy()
        [...]

Frederic Bazin 25.04.2016

Scrapy - hasil terakhir saja

Jawaban (1)

Pertanyaan tentang topik tersebut