html - Python not progressing a list of links -
so, need more detailed data have dig bit deeper in html code of website. wrote script returns me list of specific links detail pages, can't bring python search each link of list me, stops @ first one. doing wrong?
beautifulsoup import beautifulsoup import urllib2 lxml import html import requests #open site html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx") #inform beautifulsoup soup = beautifulsoup(html_page) #search specific links link in soup.findall('a', href=re.compile('/d/part/of/thelink/ineed.aspx')): #print found links print link.get('href') #complete links complete_links = 'http://www.sitetoscrape.ch' + link.get('href') #print complete links print complete_links # #everything works fine point # page = requests.get(complete_links) tree = html.fromstring(page.text) #details name = tree.xpath('//dl[@class="services"]') in name: print i.text_content()
also: tutorial can recommend me learn how put output in file , clean up, give variable names, etc?
i think want list of links in complete_links
instead of single link. @pynchia , @lemonhead said you're overwritting complete_links
every iteration of first loop.
you need 2 changes:
append links list , use loop , scrap each link
# [...] same code here links_list = [] link in soup.findall('a', href=re.compile('/d/part/of/thelink/ineed.aspx')): print link.get('href') complete_links = 'http://www.sitetoscrape.ch' + link.get('href') print complete_links link_list.append(complete_links) # append new link list
scrap each accumulated link in loop
for link in link_list: page = requests.get(link) tree = html.fromstring(page.text) #details name = tree.xpath('//dl[@class="services"]') in name: print i.text_content()
ps: recommend scrapy framework tasks that.
Comments
Post a Comment