2017-06-20 8 views
-2

この非同期(より高速な)ソリューションを使用すると、サイトからすべてのURLを収集し、プログラムが入っているディレクトリと同じディレクトリのテキストファイルに書き込むことができます。 テキストファイルをさらにクロールするために使用したり、データベースとして機能させたりすることもできます。AsyncIOとaiohttpを使用してサイトをクロールしてすべてのURLを収集するプログラム

元のコードは、私が変更して、まったく機能し、私がしたいことをするようにしました。パラレル・タスクの数を増やすことができます。いくつかの良い解決策が見つかったら、ここにあなたの変更を公開してください。

は私が解決状態を作成するための答えを使用した分散コンピューティングのhtml-javascriptの-pythonの

#!/usr/bin/env python3 

# python 3.5 async web crawler. 
# https://github.com/mehmetkose/python3.5-async-crawler 

# Licensed under the MIT license: 
# http://www.opensource.org/licenses/mit-license 
# Copyright (c) 2016 Mehmet Kose [email protected] 

# Copyright() 2017 Drill Bit but I am not so fuzzy about it. 


''' 
A bot that surfs a site and extracts all links from it. It does not at the moment move to other sites. 
It puts in site pages in one text file, external sites in another text-file and left overs such as media files in a separate file. 
After the search is complete it cleans the data and removes duplicates which is put in cleaned files. Than a back up is written and the content is rewritten to the original file for the next search. This way a database can increase over time. 

You are only given web adresses with this program. 

You can either put in content extraction into this program or write another program to read from the list this program generates and extract relevant content. It will really hammer the server so thread I mean async carefully. On a normal laptop, run at maximum four instances on different sites at the same time. It will use CPU. Not so much RAM though. 

There are many interesting things one could do with multiple computers, a central web address database and distributed crawling. Any hand held device could help with a scraping collaboration in exchange for access to the total effort. Trade in crawling for access. If you want more info on the idea all I need is a place to cook (Fear and Loathing in Las Vegas). 

It is sensitive to the root_url. 


''' 
import aiohttp 
import asyncio 
import async_timeout 
from urllib.parse import urljoin, urldefrag 

#calculate execution time. Hours... 
import time 
start = time.time() 

antal_arbetare = 10 #original är 3 

#PUT IN THE SITE TO CRAWL HERE 

root_url = "http://www.bbc.com" 

crawled_urls, url_hub = [], [root_url, "%s/sitemap.xml" % (root_url), "%s/robots.txt" % (root_url)] 

#Choose a webbrowser to present to the server. 
#headers = {'user-agent': 'Opera/9.80 (X11; Linux x86_64; U; en) Presto/2.2.15 Version/10.10'} 
#headers = {'user-agent': 'Firefox/53 (X11; Linux x86_64; U; en) sinGUlaro/5.2.12 Ver/2.8'} 
headers = {'user-agent': 'Chrome (Windows 7; B; br) Presto/2.2.15 Version/10.10'} 
#headers = {'user-agent': 'Chromium112AD (Windows 10; CZ) Xenialtrouble-0.2.0'} 
#headers = {'user-agent': 'Firefox12 (Win XP; build-1076-1) tubular bells'} 
#headers = {ADD YOUR OWN} 

async def get_body(url): 
    async with aiohttp.ClientSession() as session: 
     try: 
      with async_timeout.timeout(5): 
       async with session.get(url, headers=headers) as response: 
        if response.status == 200: 
         html = await response.text() 
         return {'error': '', 'html': html} 
        else: 
         return {'error': response.status, 'html': ''} 
     except Exception as err: 
      return {'error': err, 'html': ''} 


async def handle_task(task_id, work_queue): 
    while not work_queue.empty(): 
     queue_url = await work_queue.get() 
     if not queue_url in crawled_urls: 
      a = open('crawledsites.txt','a') 
      a.write(queue_url+'\n') 
      crawled_urls.append(queue_url) 
      body = await get_body(queue_url) 
      if not body['error']: 
       for new_url in get_urls(body['html']): 
        if root_url in new_url and not new_url in crawled_urls: 
         work_queue.put_nowait(new_url) 
         #Tests 
        else: 
         if root_url.split("//")[-1].split("/")[0].split('www.')[-1] not in new_url: 
          if 'file:///' in new_url: 
           wt = open('localfiles.txt','a') 
           wt.write(new_url+'\n') 
           wt.close() 
          else: 
           if 'adserver' not in new_url and 'facebook' not in new_url and 'tel:' not in new_url and 'file:///' not in new_url and 'googleapis' not in new_url and 'javascript' not in new_url and 'yimg.com' not in new_url and 'btrll.com' not in new_url and 'flickr.com' not in new_url and 'tv.nu' not in new_url and 'klart.se' not in new_url and 'twitter.com' not in new_url and 'linkedin.com' not in new_url and 'facebook.com' not in new_url and 'instagram.com' not in new_url and 'mailto:' not in new_url: 
            ct = open('externaurl.txt','a') 
            ct.write(new_url+'\n') 
            ct.close() #sluttest 
      else: 
       erroro = open('notapprovedlinks.txt','a') 
       erroro.write(queue_url+'\n') 
       erroro.close() #https://stackoverflow.com/questions/19457227/how-to-print-like-printf-in-python3 



def remove_fragment(url): 
    pure_url, frag = urldefrag(url) 
    return pure_url 

def get_urls(html): 
    new_urls = [url.split('"')[0] for url in str(html).replace("'",'"').split('href="')[1:]] 
    return [urljoin(root_url, remove_fragment(new_url)) for new_url in new_urls] 

if __name__ == "__main__": 
    q = asyncio.Queue() 
    [q.put_nowait(url) for url in url_hub]  
    loop = asyncio.get_event_loop() 
    tasks = [handle_task(task_id, q) for task_id in range(antal_arbetare)] 
    loop.run_until_complete(asyncio.wait(tasks)) 
    loop.close() 


#print and log the time 
end = time.time() 
langd = end-start 
logg= open('logg.txt','a') 
logg.write(str(root_url)+' - '+str(langd)+' sek - '+str(len(crawled_urls))+'\n') 
logg.close() 
print(str(root_url)+' '+str(len(crawled_urls))) 

################################################################# 
#   AFTERBIRTH - The miracle of birth     # 
################################################################# 
lines_seen = set() # holds lines already seen 
outfile = open('externa-sorterade.txt', "w") 
for line in open('externaurl.txt', "r"): 
    if line not in lines_seen:# not a duplicate 
       lines_seen.add(line) 
outfile.writelines(sorted(lines_seen)) 
outfile.close() 


outfile = open(str(time.localtime()[0])+'_'+str(time.localtime()[2])+'_'+str(time.localtime()[4])+'_externbackupp.txt', 'w') 
for line in open('externa-sorterade.txt', 'r'): 
       outfile.write(line) 
outfile.close() 

#Gör ny externfil med uppdaterade rader 
outfile = open('externaurl.txt', "w") 
for line in open('externa-sorterade.txt', "r"): 
       outfile.write(line) 
outfile.close() 

########################################## 

lines_seen = set() # holds lines already seen 
outfile = open('crawledsites-sorterad.txt', "w") 
for line in open('crawledsites.txt', "r"): 
    if line not in lines_seen: # not a duplicate 
       lines_seen.add(line) 
outfile.writelines(sorted(lines_seen)) 
outfile.close() 


outfile = open(str(time.localtime()[0])+'_'+str(time.localtime()[2])+'_'+str(time.localtime()[4])+'_kravlarbackupp.txt', 'w') 
for line in open('crawledsites-sorterad.txt', 'r'): 
       outfile.write(line) 
outfile.close() 



outfile = open('crawledsites.txt', "w") 
for line in open('crawledsites-sorterad.txt', "r"): 
       outfile.write(line) 
outfile.close() 
+0

あなたの質問は何ですか? –

答えて

0

のためのいくつかのアイデアを持っています。

関連する問題