Próbuję zeskrobać thesession.org, aby utworzyć tabelę określającą, ile razy każda melodia została dodana do pamiętników memebera, dzięki czemu mogę znaleźć popularne utwory do nauki. Zacząłem od tutoriala do scrapy here i próbuję go zmodyfikować, aby pasował do moich celów. Problem polega na tym, że chociaż strona thesession.org wydaje się mieć około 10.390 melodii, mój skrobak zwraca dane tylko 10 z nich (tylko te na http://www.thesession.org/tunes/index.php). Jak mogę uzyskać dane na temat wszystkich melodii (lub najwyżej pozycjonowanych stu utworów)? Każda rada byłaby bardzo doceniana.Scrapcja Pythona wydaje się nie mieć danych ze wszystkich dostępnych URL-i.
Oto co mam do tej pory:
items.py
from scrapy.item import Item, Field
class tuneItem(Item):
url = Field()
name1 = Field()
name2 = Field()
key = Field()
count = Field()
pass
tune_spider.py
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings
class tunesSpider(CrawlSpider):
name = "irishtunes"
allowed_domains = ["thesession.org"]
start_urls = ["http://www.thesession.org/tunes"]
rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]
def parse_tune(self, response):
x = HtmlXPathSelector(response)
tune = tuneItem()
tune['url'] = response.url
tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
tune['key'] = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
return tune
biegnę skrobak otwierając konsolę, przechodząc do katalogu zawierającego plik cfg tutoriala i działający scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv
Oto, co otrzymam:
C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
{'count': [u'1'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Brendan Begley's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
{'count': [u'3'],
'key': [u'Key signature: Amajor'],
'name1': [u'Carleton County Breakdown'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
{'count': [u'3'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Kasper's Rant"],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
{'count': [u'5'],
'key': [u'Key signature: Gmajor'],
'name1': [u'The Full Of The Bag'],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
{'count': [u'1'],
'key': [u'Key signature: Adorian'],
'name1': [u'The New Steamboat'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
{'count': [u'4'],
'key': [u'Key signature: Gmajor'],
'name1': [u"Galen's Arrival"],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
{'count': [u'2'],
'key': [u'Key signature: Amixolydian'],
'name1': [u'Culloden Day'],
'name2': [u'strathspey'],
'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
{'count': [u'2'],
'key': [u'Key signature: Aminor'],
'name1': [u'Miss Sine Flemington'],
'name2': [u'barndance'],
'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
{'count': [u'2'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Joan Martin's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
{'count': [u'2'],
'key': [u'Key signature: Gmajor'],
'name1': [u'My Time Inside 2005'],
'name2': [u'waltz'],
'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
{'downloader/request_bytes': 3655,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 31620,
'downloader/response_count': 12,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
'item_scraped_count': 10,
'request_depth_max': 1,
'scheduler/memory_enqueued': 12,
'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
{}
EDIT: Odpowiedź z @reclosedev dostał mi na drodze. Dla każdego, zastanawiając się o wynik, oto zrzut ...
(1) Zdecydowana większość utworów jest mniejsza niż 10 tunebooks członków
(2) Popularność wszystkich 10,379 melodii że mogę zeskrobać z serwisu (mierzone przez ile są one w tunebooks) następuje rozkład potęgowego
(3) a oto utwory, które w> 1000 tU nebooks na stronie, pokazując nazwiska najlepszych w rankingu melodii i ile tunebooks są w
Interesujące wyniki, ale możesz zaoszczędzić sobie trudu: http://www.irishtune.info/session/tunes.php – alanng