BeautifulSoup常用函数

所有函数和属性

参考 Python & BeautifulSoup - can I use the function findAll repeatedly? - Stack Overflow 中别人回复,可以看出soup的所有属性和函数是:

f.HTML_FORMATTERS           f.has_attr
f.XML_FORMATTERS            f.has_key
f.append                    f.hidden
f.attribselect_re           f.index
f.attrs                     f.insert
f.can_be_empty_element      f.insert_after
f.childGenerator            f.insert_before
f.children                  f.isSelfClosing
f.clear                     f.is_empty_element
f.contents                  f.name
f.decode                    f.namespace
f.decode_contents           f.next
f.decompose                 f.nextGenerator
f.descendants               f.nextSibling
f.encode                    f.nextSiblingGenerator
f.encode_contents           f.next_element
f.extract                   f.next_elements
f.fetchNextSiblings         f.next_sibling
f.fetchParents              f.next_siblings
f.fetchPrevious             f.parent
f.fetchPreviousSiblings     f.parentGenerator
f.find                      f.parents
f.findAll                   f.parserClass
f.findAllNext               f.parser_class
f.findAllPrevious           f.prefix
f.findChild                 f.prettify
f.findChildren              f.previous
f.findNext                  f.previousGenerator
f.findNextSibling           f.previousSibling
f.findNextSiblings          f.previousSiblingGenerator
f.findParent                f.previous_element
f.findParents               f.previous_elements
f.findPrevious              f.previous_sibling
f.findPreviousSibling       f.previous_siblings
f.findPreviousSiblings      f.recursiveChildGenerator
f.find_all                  f.renderContents
f.find_all_next             f.replaceWith
f.find_all_previous         f.replaceWithChildren
f.find_next                 f.replace_with
f.find_next_sibling         f.replace_with_children
f.find_next_siblings        f.select
f.find_parent               f.select_one
f.find_parents              f.setup
f.find_previous             f.string
f.find_previous_sibling     f.strings
f.find_previous_siblings    f.stripped_strings
f.format_string             f.tag_name_re
f.get                       f.text
f.getText                   f.unwrap
f.get_text                  f.wrap

供概览了解有哪些属性和功能。

常见属性

常见用法

tag的名字

curTagStr = eachSoupNode.name

得到:

<XCUIElementTypeStaticText type="XCUIElementTypeStaticText" value="兴业 信用卡" name="兴业 信用卡" label="兴业 信用卡" enabled="true" visible="true" x="99" y="593" width="81" height="18"/>

中的tag名:XCUIElementTypeStaticText

节点的attrs属性是dict字典

curAttrib = eachSoupNode.attrs

就是一个dict了,对于:

<XCUIElementTypeButton type="XCUIElementTypeButton" enabled="true" visible="true" x="0" y="0" width="414" height="691">

值是:

{'enabled': 'true', 'height': '691', 'type': 'XCUIElementTypeButton', 'visible': 'true', 'width': '414', 'x': '0', 'y': '0'}

另外例子:

html:

<h4>
    <a href="../sanguozhanji/" target="_blank" title="三国战纪"><em
            class='keyword'>三国</em>战纪(官方正版)</a>
    <span>
        20年经典风靡街机厅
    </span>
</h4>

获取属性:

h4Soup = dtSoup.find("h4")
h4aSoup = h4Soup.find("a")
h4aAttrDict = h4aSoup.attrs # h4aAttrDict={'href': '../sanguozhanji/', 'target': '_blank', 'title': '三国战纪'}
aHref = h4aAttrDict["href"] # '../sanguozhanji/'
aTitle = h4aAttrDict["title"] # '三国战纪'

删除某个属性

官网文档:attributes

就像删除dict中的某个key一样:

del curNode.attrs["keyToDelete"]

或:

curNodeAttributeDict = curNode.attrs
del curNodeAttributeDict["keyToDelete"]

常见函数操作

soup.find

更多实际用法举例:

    # <h1 class="h1user">crifan</h1>

    # method 1: no designate para name
    h1userSoup = soup.find("h1", {"class":"h1user"})

    # method 2: use para name
    h1userSoup = soup.find(name="h1", attrs={"class":"h1user"})

    h1userUnicodeStr = h1userSoup.string

修改其中内容:

注:只能改(Tag的)中的属性的值,不能改(Tag的)的值本身

soup.body.div.h1.string = changedToString

soup.body.div.h1['class'] = "newH1User"

soup.findall

默认findall会返回匹配的所有的元素

想要限制返回个数,可以加limit

soup.find_all('title', limit=2)

特殊:

find == limit=1findall

即:如下是相同含义

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

decompose 删除节点

nodeToDelete.decompose()

官网文档:decompose()

results matching ""

    No results matching ""