Rozszerzanie selektorów CSS w BeautifulSoup

Pytanie:Rozszerzanie selektorów CSS w BeautifulSoup

BeautifulSoup zapewnia bardzo ograniczone wsparcie dla CSS selectors. Na przykład, jedyną obsługiwaną pseudoklasą jest nth-of-type, która akceptuje tylko wartości liczbowe - argumenty takie jak even lub odd są niedozwolone.

Czy można rozszerzyć selektory CSS na BeautifulSoup, czy też użyć wewnętrznie jako lxml.cssselect jako mechanizmu selekcji CSS?

Weźmy spojrzeć na przykład problemu/przypadek użycia. Zlokalizuj tylko nawet wiersze w następujący kod HTML:

<table> 
    <tr> 
     <td>1</td> 
    <tr> 
     <td>2</td> 
    </tr> 
    <tr> 
     <td>3</td> 
    </tr> 
    <tr> 
     <td>4</td> 
    </tr> 
</table>

W lxml.html i lxml.cssselect, to łatwo zrobić poprzez :nth-of-type(even):

from lxml.html import fromstring 
from lxml.cssselect import CSSSelector 

tree = fromstring(data) 

sel = CSSSelector('tr:nth-of-type(even)') 

print [e.text_content().strip() for e in sel(tree)]

Ale w BeautifulSoup:

print(soup.select("tr:nth-of-type(even)"))

rzucał błąd:

NotImplementedError: Only numeric values are currently supported for the nth-of-type pseudo-class.

pamiętać, że możemy obejść go .find_all():

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])

Źródło

2015-12-21 alecxe

Po sprawdzeniu kodu źródłowego wydaje się, że BeautifulSoup nie zapewnia dogodny punkt w jego interfejs rozszerzenia lub małpa załatać swoje dotychczasowe funkcje w tym zakresie. Korzystanie z funkcji od lxml nie jest możliwe, ponieważ BeautifulSoup używa tylko lxml podczas analizowania i używa wyników analizowania, aby utworzyć z nich odpowiednie obiekty.Obiekty lxml nie są zachowywane i nie można uzyskać do nich dostępu później.

To powiedziane, z wystarczającą determinacją i elastycznością oraz możliwościami introspekcji w Pythonie wszystko jest możliwe. Można zmodyfikować metodę wewnętrzne BeautifulSoup nawet w czasie wykonywania:

import inspect 
import re 
import textwrap 

import bs4.element 


def replace_code_lines(source, start_token, end_token, 
         replacement, escape_tokens=True): 
    """Replace the source code between `start_token` and `end_token` 
    in `source` with `replacement`. The `start_token` portion is included 
    in the replaced code. If `escape_tokens` is True (default), 
    escape the tokens to avoid them being treated as a regular expression.""" 

    if escape_tokens: 
     start_token = re.escape(start_token) 
     end_token = re.escape(end_token) 

    def replace_with_indent(match): 
     indent = match.group(1) 
     return textwrap.indent(replacement, indent) 

    return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token), 
        replace_with_indent, source, flags=re.MULTILINE) 


# Get the source code of the Tag.select() method 
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select)) 

# Replace the relevant part of the method 
start_token = "if pseudo_type == 'nth-of-type':" 
end_token = "else" 
replacement = """\ 
if pseudo_type == 'nth-of-type': 
    try: 
     if pseudo_value in ("even", "odd"): 
      pass 
     else: 
      pseudo_value = int(pseudo_value) 
    except: 
     raise NotImplementedError(
      'Only numeric values, "even" and "odd" are currently ' 
      'supported for the nth-of-type pseudo-class.') 
    if isinstance(pseudo_value, int) and pseudo_value < 1: 
     raise ValueError(
      'nth-of-type pseudo-class value must be at least 1.') 
    class Counter(object): 
     def __init__(self, destination): 
      self.count = 0 
      self.destination = destination 

     def nth_child_of_type(self, tag): 
      self.count += 1 
      if pseudo_value == "even": 
       return not bool(self.count % 2) 
      elif pseudo_value == "odd": 
       return bool(self.count % 2) 
      elif self.count == self.destination: 
       return True 
      elif self.count > self.destination: 
       # Stop the generator that's sending us 
       # these things. 
       raise StopIteration() 
      return False 
    checker = Counter(pseudo_value).nth_child_of_type 
""" 
new_src = replace_code_lines(src, start_token, end_token, replacement) 

# Compile it and execute it in the target module's namespace 
exec(new_src, bs4.element.__dict__) 
# Monkey patch the target method 
bs4.element.Tag.select = bs4.element.select

This jest część kodu została zmodyfikowana.

Oczywiście wszystko to jest eleganckie i niezawodne. Nie wyobrażam sobie, żeby kiedykolwiek nadeszła ta poważnie używana.

Źródło

2015-12-30 05:46:59

Dzięki za silne "Nie wyobrażam sobie, że jest to poważnie używane wszędzie, nigdy!" :) – alecxe

Oficjalnie BeautifulSoup nie obsługuje wszystkich selektorów CSS.

Jeśli python nie jest jedynym wyborem, zdecydowanie polecam JSoup (odpowiednik java tego). Obsługuje wszystkie selektory CSS.

To jest open source (MIT licencji)
Składnia jest prosta
Obsługuje wszystkie selektory CSS
może obejmować wiele wątków zbyt skalować się
Rich wsparcie API w języku Java, aby przechowywać w DBs. Łatwo jest to zintegrować.

Drugi alternatywny sposób, jeśli nadal chcesz trzymać się Pythona, sprawiają, że implementacja jython.

http://jsoup.org/

https://github.com/jhy/jsoup/

Źródło

2015-12-24 10:56:40