ฉันจะอ้างอิงพาเรนต์และลบองค์ประกอบพาเรนต์ใน RSS XML ผ่าน LXML ใน Python ได้อย่างไร

ฉันประสบปัญหาในการแคร็กอันนี้ ฉันมีฟีด RSS ในรูปแบบไฟล์ XML อย่างง่ายดูเหมือนว่านี้:

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>

วัตถุประสงค์ของฉันคือการตรวจสอบว่าแท็กคำอธิบายที่สองมีสตริงบางอย่างหรือไม่ หากมีสตริงนั้นอยู่ ฉันต้องการลบออกทั้งหมด ขณะนี้อยู่ในรหัสของฉันฉันมีสิ่งนี้:

doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')


for desc in found:
    if "FORBIDDENSTRING" in desc.text:
        desc.getparent().remove(desc)

และจะลบแท็กคำอธิบายที่สองซึ่งสมเหตุสมผลออก แต่ฉันต้องการให้ รายการ ทั้งหมดหายไป ฉันไม่รู้ว่าจะระงับองค์ประกอบ 'รายการ' ได้อย่างไร หากฉันมีเพียงข้อมูลอ้างอิง 'รายละเอียด'

ฉันได้ลองใช้ Google และค้นหาที่นี่แล้ว แต่สถานการณ์ที่ฉันเห็นเพียงต้องการลบแท็กเหมือนที่ฉันทำอยู่ตอนนี้ น่าแปลกที่ฉันไม่ได้สะดุดกับโค้ดตัวอย่างที่ต้องการกำจัดออบเจ็กต์พาเรนต์ทั้งหมด ยินดีเป็นอย่างยิ่งที่จะให้คำแนะนำเกี่ยวกับเอกสาร/บทช่วยสอนหรือความช่วยเหลือ

Bono Vanderpoorten 17.05.2018 แหล่งที่มา

คำตอบ (2)

arrow_upward
1
arrow_downward

ฉันเป็นแฟนตัวยงของ XSLT แต่อีกทางเลือกหนึ่งคือเพียงเลือก item แทน description (เลือกองค์ประกอบที่คุณต้องการลบ ไม่ใช่องค์ประกอบย่อย)

นอกจากนี้ หากคุณใช้ xpath() คุณสามารถใส่การตรวจสอบสตริงที่ต้องห้ามได้โดยตรงในเพรดิเคต xpath

ตัวอย่าง...

from lxml import etree

testString = """
<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description></description>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>
            <guid></guid>
            <pubDate></pubDate>
            <author/>
            <title>Title of the item</title>
            <link href="https://example.com" rel="alternate" type="text/html"/>
            <description>
            <![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
            </description>
            <description>
            <![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
            </description>
        </item>
        <item>...</item>
    </channel>
</rss>
"""

forbidden_string = "I want to get rid of the whole item"

parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))

for item in found:
    item.getparent().remove(item)

print(etree.tostring(doc, encoding="unicode", pretty_print=True))

ลายนี้...

<rss version="2.0">
    <channel>
        <title>My RSS Feed</title>
        <link href="https://www.examplefeedurl.com">Feed</link>
        <description/>
        <item>...</item>
        <item>...</item>
        <item>...</item>
        <item>...</item>
    </channel>
</rss>

Daniel Haley 17.05.2018

comment

ฉันต้องใช้ตรรกะเพิ่มเติมกับข้อความภายในแท็กคำอธิบาย ยกเว้นการตรวจสอบสตริงที่ต้องห้าม แต่เคล็ดลับในการใช้ item element ช่วยให้ฉันไปถูกทาง ฉันใช้ item element ทำให้ ChildElementIterator ใช้ตรรกะของฉัน และฉันสามารถโทรกลับไปยัง item เพื่อลบออกได้เหมือนในตัวอย่างของคุณ! ขอบคุณมาก! - Bono Vanderpoorten; 22.05.2018

arrow_upward
0
arrow_downward

ลองพิจารณา XSLT ซึ่งเป็นภาษาสำหรับวัตถุประสงค์พิเศษที่ออกแบบมาเพื่อแปลงไฟล์ XML เช่น การลบโหนดตามเงื่อนไขโดย ค่า. Python lxml สามารถรันสคริปต์ XSLT 1.0 และยังส่งพารามิเตอร์จากสคริปต์ Python ไปยัง XSLT ได้ (ไม่ต่างจากการส่งพารามิเตอร์ใน SQL!) ด้วยวิธีนี้ คุณจะหลีกเลี่ยงการวนซ้ำ for หรือตรรกะ if หรือสร้างแผนผังใหม่ที่เลเยอร์แอปพลิเคชัน

XSLT (บันทึกเป็นไฟล์ .xsl ซึ่งเป็นไฟล์ .xml พิเศษ)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" cdata-section-elements="description"/>
  <xsl:strip-space elements="*"/>

  <!-- VALUE TO BE PASSED INTO FROM PYTHON -->
  <xsl:param name="search_string" />       

  <!-- IDENTITY TRANSFORM -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
  <xsl:template match="channel">
    <xsl:copy>
      <xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (สำหรับการสาธิต ด้านล่างจะทำการค้นหาสองครั้งโดยใช้ตัวอย่างที่โพสต์)

import lxml.etree as et

# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)

print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>
#       <guid/>
#       <pubDate/>
#       <author/>
#       <title>Title of the item</title>
#       <link href="https://example.com" rel="alternate" type="text/html"/>
#       <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
#       <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
#     </item>
#     <item>...</item>
#   </channel>
# </rss>

# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)

print(result)    
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#   </channel>
# </rss>

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Parfait 17.05.2018

comment

เนื่องจากคุณใช้เทมเพลตกับ item ในบริบท channel เท่านั้น คุณจะสูญเสียรายการย่อยอื่นๆ ของ channel (เช่น title, description และ link) สิ่งที่ฉันจะทำคือลบช่องทางการจับคู่เทมเพลตและเพิ่มเทมเพลตที่ตรงกัน item เนื่องจากคุณไม่สามารถอ้างอิงพารามิเตอร์/ตัวแปรในรูปแบบการจับคู่ใน XSLT 1.0 ได้ ฉันจะเพิ่ม xsl:if test ด้วย not(description[contains(.,$search_string)]) (ไม่ทดสอบตำแหน่งของ description) และหากเป็นจริง ให้ส่งออก item (xsl:copy w /xsl:apply-templates เพื่อให้เป็นรูปแบบการกด) - Daniel Haley; 17.05.2018

ฉันจะอ้างอิงพาเรนต์และลบองค์ประกอบพาเรนต์ใน RSS XML ผ่าน LXML ใน Python ได้อย่างไร

คำตอบ (2)

คำถามในหัวข้อ