Univocityでの ""、 " - " CSVの処理

どのように適切な行を得ることができますか？いくつかの線が糊付けされていて、それを止める方法や理由を理解できません。Univocityでの ""、 " - " CSVの処理

col. 0: Date 
    col. 1: Col2 
    col. 2: Col3 
    col. 3: Col4 
    col. 4: Col5 
    col. 5: Col6 
    col. 6: Col7 
    col. 7: Col7 
    col. 8: Col8 

    col. 0: 2017-05-23 
    col. 1: String 
    col. 2: lo rem ipsum 
    col. 3: dolor sit amet 
    col. 4: mcdonalds.com/online.html 
    col. 5: null 
    col. 6: "","-""-""2017-05-23" 
    col. 7: String 
    col. 8: lo rem ipsum 
    col. 9: dolor sit amet 
    col. 10: burgerking.com 
    col. 11: https://burgerking.com/ 
    col. 12: 20 
    col. 13: 2 
    col. 14: fake 

    col. 0: 2017-05-23 
    col. 1: String 
    col. 2: lo rem ipsum 
    col. 3: dolor sit amet 
    col. 4: wendys.com 
    col. 5: null 
    col. 6: "","-""-""2017-05-23" 
    col. 7: String 
    col. 8: lo rem ipsum 
    col. 9: dolor sit amet 
    col. 10: buggagump.com 
    col. 11: null 
    col. 12: "","-""-""2017-05-23" 
    col. 13: String 
    col. 14: cheese 
    col. 15: ad eum 
    col. 16: mcdonalds.com/online.html 
    col. 17: null 
    col. 18: "","-""-""2017-05-23" 
    col. 19: String 
    col. 20: burger 
    col. 21: ludus dissentiet 
    col. 22: www.mcdonalds.com 
    col. 23: https://www.mcdonalds.com/ 
    col. 24: 25 
    col. 25: 3 
    col. 26: fake 

    col. 0: 2017-05-23 
    col. 1: String 
    col. 2: wine 
    col. 3: id erat utamur 
    col. 4: bubbagump.com 
    col. 5: https://buggagump.com/ 
    col. 6: 25 
    col. 7: 3 
    col. 8: fake 
    done

サンプルCSV（コピー/貼り付け時に\ r \ nが破損している可能性があります）。ここで利用可能：https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0

"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8" 
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-" 
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake" 
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-" 
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-" 
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-" 
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake" 
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"

ビルの設定：

CsvParserSettings settings = new CsvParserSettings(); 

    settings.setDelimiterDetectionEnabled(true); 
    settings.setQuoteDetectionEnabled(true); 

    settings.setLineSeparatorDetectionEnabled(false); // all the same using `true` 
    settings.getFormat().setLineSeparator("\r\n"); 

    CsvParser parser = new CsvParser(settings); 

    List<String[]> rows; 

    rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv")); 

    for (String[] row : rows) 
    { 
    System.out.println(""); 
    int i = 0; 

    for (String element : row) 
    { 
     System.out.println("col. " + i++ + ": " + element); 
    } 
    } 

    System.out.println("done");

出典

2017-05-26 Buffalo

私はそれが改行に関係しているとは思わない：引用設定を確認する：[CsvFormat]（http://docs.univocity.com/parsers/2.1.0/com/univocity/parsers/csvを参照） /CsvFormat.html）。 '' ''は引用されたテキストとして解釈されるようです。 – TmTron

あなたのパーサーは本当に '' ''が好きではないようです。 – pvg

@pvg、これは自動検出プロセスに関連しています。下の私の答えを見てください。 –

あなたは自動検出プロセスをテストしているように、私はあなたが検出されたフォーマットをプリントアウトすることをお勧めして：

CsvFormat format = parser.getDetectedFormat(); 
System.out.println(format);

この印刷されます：

CsvFormat: 
    Comment character=# 
    Field delimiter=, 
    Line separator (normalized)=\n 
    Line separator sequence=\r\n 
    Quote character=" 
    Quote escape character=- 
    Quote escape escape character=null

ご覧のとおり、パーサは引用符エスケープを正しく検出していません。通常、フォーマット検出プロセスは非常に優れていますが、小さなテストサンプルでは、常に適切に処理されるとは限りません。あなたのサンプルでは、それがエスケープ文字として-を拾う理由がわかりませんので、私はこれを開いて調査し、それが何を検出しているのか見るためにissueを開きました。

あなたの入力ファイルのどれもが引用符のエスケープとして-を持つことがないという事実を知っている場合は、その形式を検出し、それが入力から取り出したものをテストし、

public List<String[]> parse(File input, CsvFormat format) { 
    CsvParserSettings settings = new CsvParserSettings(); 
    if (format == null) { //no format specified? Let's detect what we are dealing with 
     settings.detectFormatAutomatically(); 

     CsvParser parser = new CsvParser(settings); 
     parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process 
     format = parser.getDetectedFormat(); //capture the format 
     parser.stopParsing(); //stop the parser - no need to read anything yet. 

     System.out.println(format); 

     if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it. 
      format.setQuoteEscape('"'); 
     } 

     return parse(input, format); //now parse with the intended format 
    } else { 
     settings.setFormat(format); //this parses with the format adjusted earlier. 
     CsvParser parser = new CsvParser(settings); 
     return parser.parseAll(input); 
    } 

}

今だけparseメソッドを呼び出します：

List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);

を、あなたのデータを適切に抽出しています。そして、このような内容は、パースお役に立てれば！

出典

2017-05-29 00:40:33

私は適切に区切られていない行に集中していました。引用符のエスケープで何が起こっていたのか見逃しました。すべては今良いです。ありがとうございました！ – Buffalo

Univocityでの ""、 " - " CSVの処理

答えて

関連する問題