私は巨大なテキストファイルを持っています。 改行をすべて削除して、段落の改行を削除して前のparagrahに追加したいとします。 Javaを使ってどうすればいいですか?私はjavaでreplaceALL()を使用していますが、私は前の段落に追加された段落を取得するつもりです。特定のテキストファイルに対してjavaを使ってすべての改行とパラグラフを削除するにはどうすればよいですか?
Please view this image for the file screenshot
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
StringBuilder sb = new StringBuilder();
System.out.println(value.toString().replaceAll("[\\t\\n]+", ""));
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll("[\\t\\n]+", ""));
String[] tokens = new String[itr.countTokens()*2];
for(int l = 0 ; l<tokens.length;l++){
if(itr.hasMoreTokens()){
tokens[l] = itr.nextToken();
}
}
for(int i = 0; i < tokens.length; i++){
if(tokens[i] != null && tokens[i] != " "){
sb.append(tokens[i]);
for(int j = i+1;j<i+5;j++){
if(tokens[j] != null)
{
sb.append(" ");
sb.append(tokens[j]);
}
}
}
word.set(sb.toString());
context.write(word, one);
//System.out.println(sb.toString());
sb.setLength(0);
}
}
入力:
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare
sn
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
** Please follow the copyright guidelines in this file. **
Title: The Complete Works of William Shakespeare
Author: William Shakespeare
Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994
Language: English
*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***
Produced by World Library, Inc., from their Library of the Future
This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS. Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!
Shakespeare
*This Etext has certain copyright implications you should read!*
予想される出力:
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by
William Shakespeare sn This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below Please follow the copyright guidelines in this file.Title: The Complete Works of William Shakespeare Author: William Shakespeare Posting Date: September 1, 2011 [EBook #100]
Release Date: January, 1994 Language: English START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE Produced by World Library, Inc., from their Library of the Future This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS. Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!! Shakespeare *This Etext has certain copyright implications you should read!*
ポスト例中に、
に変わります。また、テキスト/コードを画像/リンクとして投稿しないでください([詳細情報](https://meta.stackoverflow.com/a/285557))。投稿を修正するには[編集]オプションを使用してください。 – Pshemo
@Pshemo私はすべての改行、句読点を削除し、段落を前の段落に追加する必要があります。それは単なる段落としてのすべてです –
それはまだ非常に明確ではありません。あなたは "すべての改行"を要求しますが、これは単線を得ることを意味します。期待される出力には4行があるので、ここではそうではありません。どのような行区切り記号を残すべきか、どのように認識しましたか?また、すべての句読点を削除しなければならないと書いていますが、「リリース日:1994年1月」はもちろん、[改訂履歴:William Shakespeare]の前に「、」が表示されます。 – Pshemo