JAX（Jsoup）を使用してhtmlを解析する

jsoup（Java）を使用してhtmlドキュメントを解析する際に問題が発生しました。私はパースだHTMLの形式は：JAX（Jsoup）を使用してhtmlを解析する

..... 
<hr> 
    <a name="N1"> </a> Text 1<br> 
<hr> 
    <a name="N2"> </a> Text 2<br> 
<hr> 
    <a name="N3"> </a>Text 3<br> 
<hr> 
    <a name="N4"> </a> 
    <DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
    </DIV><br> 
<hr> 
<a name="X5"> </a> 
<DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
</DIV><br> 
<hr> 
    ...

私は2つの「時間」タグの間にHTMLテキストを分離します。私はこのコードを試しています：

File input = new File("C:\\Users\\page.html"); 
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); 
Elements body = doc.select("body"); 
Elements hrs = body.select("hr"); 
ArrayList<String> objects = new ArrayList<String>(); 
for (Element hr : hrs) { 
    String textAfterHr = hr.nextSibling().toString(); 
    objects.add(textAfterHr); 
}

System.out.println（objects）;

ArrayListには、私が望むものが含まれていないため、解決方法はわかりません。（ "hr"タグを "hr"テキスト "/ hr"タグに変換できますか？）

出典

2017-07-19 HappyDAD

ArrayListには何が含まれていますか？予想される出力は何ですか？ –

'の直後に置かれた ' 'や、'

'の間の全文には興味がありますか？ – Pshemo

ArrayListには、2つのタグの間にあるすべてのテキストが含まれています。@Pshemo私はまたはDivsを得るために解析します。 – HappyDAD

ここでは、各hrタグの子を読み取ることによって結果が得られます。より良い解決のためにこれを使用してください。

ArrayList<String> objects = new ArrayList<String>(); 
Elements hrs = body.select("hr"); 
for(int i=0;i<hrs.size();i++){ 
Element hrElm=hrs.get(i); 
Elements childrens=hrElm.children(); 
    for(Element child: childrens){ 
    String text=child.text(); 
    objects.add(text); 
} 
}

出典

2017-07-20 06:02:41

public static void main(String[] args) throws ParseException, IOException { 
    String html = ".....\n" + 
        "<hr>\n" + 
        " <a name=\"N1\"> </a> Text 1<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N2\"> </a> Text 2<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N3\"> </a>Text 3<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N4\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " <a name=\"X5\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " ..."; 
    //Split your html string at each hr tag and keep the delimiter 
    String [] splited = (html.split("(?=<hr>)")); 
    //join it back to a string using a closing hr tag 
    html = String.join("</hr>\n",splited); 
    //use the jsoup xmlParser 
    Document doc = Jsoup.parse(html,"",Parser.xmlParser()); 
    Elements eles = doc.select("hr"); 
    for(Element e : eles){ 
     System.out.println(e.html()); 
     System.out.println("-----------------------"); 
    } 
}

出典

2017-07-20 10:43:46 Eritrean

JAX（Jsoup）を使用してhtmlを解析する

答えて

関連する問題