電子メールを抽出するためにWebサイトをクロールする

私は正常に動作しないWebクローラーを持っています。 http://www.canon.de/support/consumer_products/contact_support/のようなページにアクセスすると、このページからメールを抽出します。さらに、キャノンからの他のウェブサイトへの参照があれば、それは私のクローラがこれらのすべてのページを訪問してメールを集めるものです。電子メールを抽出するためにWebサイトをクロールする

残念なことに、私のメソッドsearchforWordは機能しません。if文には決して到達しません。理由はわかりません。私のミスはどこですか？ここで

は私のクラスです：

Spider.class

public class Spider { 


private static final int MAX_PAGES_TO_SEARCH = 10; 
    private Set<String> pagesVisited = new HashSet<String>(); 
    private List<String> pagesToVisit = new LinkedList<String>(); 



    /** 
    * Our main launching point for the Spider's functionality. Internally it creates spider legs 
    * that make an HTTP request and parse the response (the web page). 
    * 
    * @param url 
    *   - The starting point of the spider 
    * @param searchWord 
    *   - The word or string that you are searching for 
    */ 
    public void search(String url) 
    { 
     while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH) 
     { 
      String currentUrl; 
      SpiderLeg leg = new SpiderLeg(); 
      if(this.pagesToVisit.isEmpty()) 
      { 
       currentUrl = url; 
       this.pagesVisited.add(url); 
      } 
      else 
      { 
       currentUrl = this.nextUrl(); 
      } 
      leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in 
            // SpiderLeg 


      leg.searchForWord(currentUrl); 



      this.pagesToVisit.addAll(leg.getLinks()); 
     } 
     System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)"); 
    } 


    /** 
    * Returns the next URL to visit (in the order that they were found). We also do a check to make 
    * sure this method doesn't return a URL that has already been visited. 
    * 
    * @return 
    */ 
    private String nextUrl() 
    { 
     String nextUrl; 
     do 
     { 
      nextUrl = this.pagesToVisit.remove(0); 
     } while(this.pagesVisited.contains(nextUrl)); 
     this.pagesVisited.add(nextUrl); 
     return nextUrl; 
    } 


    }

SpiderLeg.class

public class SpiderLeg 
{ 
// We'll use a fake USER_AGENT so the web server thinks the robot is a 
normal web browser. 


     private static final String USER_AGENT = 
     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, 
     like Gecko) Chrome/13.0.782.112 Safari/535.1"; 
private List<String> links = new LinkedList<String>(); 
private Document htmlDocument; 


/** 
* This performs all the work. It makes an HTTP request, checks the response, and then gathers 
* up all the links on the page. Perform a searchForWord after the successful crawl 
* 
* @param url 
*   - The URL to visit 
* @return whether or not the crawl was successful 
*/ 
public boolean crawl(String url) 
{ 
    try 
    { 
     Connection connection = Jsoup.connect(url).userAgent(USER_AGENT); 
     Document htmlDocument = connection.get(); 
     this.htmlDocument = htmlDocument; 
     if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code 
                 // indicating that everything is great. 
     { 
      System.out.println("\n**Visiting** Received web page at " + url); 
     } 
     if(!connection.response().contentType().contains("text/html")) 
     { 
      System.out.println("**Failure** Retrieved something other than HTML"); 
      return false; 
     } 
     Elements linksOnPage = htmlDocument.select("a[href]"); 
     //System.out.println("Found (" + linksOnPage.size() + ") links"); 
     for(Element link : linksOnPage) 
     { 
      this.links.add(link.absUrl("href")); 
     } 
     return true; 
    } 
    catch(IOException ioe) 
    { 
     // We were not successful in our HTTP request 
     return false; 
    } 
} 


public void searchForWord(String searchWord) 
{ 


    Pattern pattern = 
       Pattern.compile("([\\w\\-]([\\.\\w])+[\\w][email protected]([\\w\\-]+\\.)+[A-Za-z]{2,4})"); 

       Matcher matchs = pattern.matcher(searchWord); 

       if (matchs.find()) { 

         System.out.println(searchWord.substring(matchs.start(), matchs.end())); 


       } 

       else 
        System.out.println("hdhdadsad"); 


} 


public List<String> getLinks() 
{ 
    return this.links; 
} 

}

SpiderTest.class

public class SpiderTest 
{ 

    public static void main(String[] args) 
    { 
     Spider spider = new Spider(); 
     spider.search("http://www.canon.de/support/consumer_products/contact_support/"); 
    } 
}

出典

2017-06-08 Blnpwr

は "動作しない" の定義しますか？ –

ifステートメントは決して真実ではありません。理由はわかりません。 – Blnpwr

私はあなたの質問を "なぜ私の正規表現が動作していない"として書き直し、正規表現しようとしている文字列を表示します。すべてのネットワーキングのものは無関係です。 –

あなたの正規表現が有効です。

は、すべての一致を検索する代わりにifのコード以下、ここでhttp://www.regexpal.com/?fam=97822

使用を参照してください。

while (matchs.find()) { 
    System.out.println(matchs.group()); 
}

出典

2017-06-08 15:35:56

いいえ、OPは正規表現に間違ったものを渡しています。 – Taylor

regexが有効で、私はそれをテストしました –

一般的な正規表現は有効ですが、私はまだif文に到達できません。 – Blnpwr

電子メールを抽出するためにWebサイトをクロールする

答えて

関連する問題