Javaを使用して大きなファイルからコメントを削除する

ファイルサイズが10 MBを超える.sh、.txt、.sql、.pkbなどのファイルは100k行以上を意味します。Javaを使用して大きなファイルからコメントを削除する

これらのファイルからコメントを削除し、さらに非コメントのコンテンツを使用したいと考えています。私はそれのために次のコードを書いています。

/** 
* Removes all the commented part from the file content as well as returns a 
* file structure which have just lines with declaration syntax for eg. 
* Create Package packageName <- Stores all decalartion lines as separate 
* string in an array 
* 
* @param file 
* @return file content 
* @throws IOException 
*/ 
private static String[] filterContent(File file) throws IOException { 

    String withoutComment = ""; 
    String declare = ""; 
    String[] content; 
    List<String> readLines = FileUtils.readLines(file); 

    int size = readLines.size(); 
    System.out.println(file.getName() + " Files number of lines "+ size + " at "+new Date()); 
    String[] declareLines = new String[size]; 
    int startComment = 0; 
    int endComment = 0; 
    Boolean check = false; 
    int j = 0; 
    int i=0; 
    // Reading content line by line 
    for (String line:readLines) { 
     // If line contains */ that means comment is ending in this line, 
     // making a note of the line number 
     if (line.toString().contains("*/")) { 
      endComment = i; 
      // Removing the content before */ from the line 
      int indexOf = line.indexOf("*/"); 
      line = line.replace(line.substring(0, indexOf + 2), ""); 
     } 

     // If startComment is assigned fresh value and end comment hasn't, 
     // that means the current line is part of the comment 
     // Ignoring the line in this case and moving on to the next one 
     if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
      continue; 

     // If line contains /* that means comment is starting in this line, 
     // making a note of the line number 
     if (line.contains("/*")) { 
      startComment = i; 
      // Removing the content after /* from the line 
      int indexOf = line.indexOf("/*"); 
      line = line.replace(line.substring(indexOf), ""); 
      if (i == 0) 
       check = true; // means comment in the very first line 
     } 

     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("--")) { 
      int indexOf = line.indexOf("--"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 
     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("#")) { 
      int indexOf = line.indexOf("#"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 

     // At this point, all commented part is removed from the line, hence 
     // appending it to the final content 
     if (!line.isEmpty()) 
      withoutComment = withoutComment + line + " \n"; 
     // If line contains CREATE its a declaration line, holding it 
     // separately in the array 
     if (line.toUpperCase().contains(("CREATE"))) { 
      // If next line does not contains Create and the current line is 
      // the not the last line, 
      // then considering two consecutive lines as declaration line, 
      if (i < size - 1 && !readLines.get(i + 1).toString().toUpperCase().contains(("CREATE"))) { 
       declare = line + " " + readLines.get(i + 1).toString() + "\n"; 
      } else if (i < size) {// If the line is last line, including 
            // that line alone. 
       declare = line + "\n"; 
      } 

      declareLines[j] = declare.toUpperCase(); 
      j++; 
     } 
     i++; 
    } 
    System.out.println("Read lines "+ new Date()); 
    List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
    list.removeAll(Collections.singleton(null)); 

    content = list.toArray(new String[list.size() + 1]); 

    withoutComment = withoutComment.toUpperCase(); 
    content[j] = withoutComment; 
    System.out.println("Retruning uncommented content "+ new Date()); 
    return content; 
} 


public static void main(String[] args) { 
     String[] content = filterContent(new File("abc.txt")); 
}

このコードの問題は、ファイルサイズが大きすぎると遅すぎます。 10 MBのファイルの場合、コメントを削除するのに6時間以上かかります。（SSHサーバー上でコードが実行されました）。

ファイルを最大100MBまで持つことができます。コメントの削除には数日かかることがあります。コメントをもっと早く削除するにはどうすればいいですか？

更新：私の問題は単に行を変更する方法を変更するだけで解決されるものではないので、質問は重複しません。それはプロセスを遅くする文字列アクティビティであり、コメント除去アクティビティをより速くする方法が必要です。

出典

2017-02-17 Harshita Sethi

1.ファイル全体をメモリに保存しないでください。 2.なぜあなたはそれをしたいのですか？ – Axel

まず、リストに入れないで、InputStreamを使ってファイルを読み込んで直接行を解析してください。行に '/ *'や '/ * ... * /'が含まれているかどうかを簡単に調べることができます。これを削除して、コメントなしで新しいファイルを作り直してください。 100MB以上のファイルを読み込んでもそれほど長い時間はかかりませんでした。 – AxelH

[大文字のテキストファイルをJavaで1行ずつ読み込むには？]（http://stackoverflow.com/questions/5868369/how-to） -read-a-large-text-file-line-by-line-using-java） – AxelH

私のコードの最大の問題は、Stringsの使用でした。どんな方法でも行を読むのには違いはありませんが、Stringの代わりにStringBuilderを使用すると、コメント化されていない行を格納することができます。 StringBuilderと同じコードで、何時間も前にコメントを削除するのに数秒かかります。

ここにコードがあります。パフォーマンスを向上させるため、ListをBufferedReaderに変更しました。

/** 
    * Removes all the commented part from the file content as well as returns a 
    * file structure which have just lines with declaration syntax for eg. 
    * Create Package packageName <- Stores all decalartion lines as separate 
    * string in an array 
    * 
    * @param file 
    * @return file content 
    * @throws IOException 
    */ 
    private static List<String> filterContent(File file) throws IOException { 

     StringBuilder withoutComment = new StringBuilder(); 
//  String declare = ""; 
//  String[] content; 
//  List<String> readLines = FileUtils.readLines(file); 
// 
//  int size = readLines.size(); 
     System.out.println(file.getName() + " at " + new Date()); 
     List<String> declareLines = new ArrayList<String>(); 
     // String line = null; 
     int startComment = 0; 
     int endComment = 0; 
     Boolean check = false; 
     Boolean isLineDeclaration = false; 

     int j = 0; 
     int i = 0; 

     InputStream in = new FileInputStream(file); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in)); 
     String line; 
     // Reading content line by line 
     while ((line = reader.readLine()) != null) { 
      // for (int i = 0; i < size; i++) { 
      // line = readLines.get(i).toString();// storing current line data 
      // If line contains */ that means comment is ending in this line, 
      // making a note of the line number 
      if (line.toString().contains("*/")) { 
       endComment = i; 
       // Removing the content before */ from the line 
       int indexOf = line.indexOf("*/"); 
       line = line.replace(line.substring(0, indexOf + 2), ""); 
      } 

      // If startComment is assigned fresh value and end comment hasn't, 
      // that means the current line is part of the comment 
      // Ignoring the line in this case and moving on to the next one 
      if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
       continue; 

      // If line contains /* that means comment is starting in this line, 
      // making a note of the line number 
      if (line.contains("/*")) { 
       startComment = i; 
       // Removing the content after /* from the line 
       int indexOf = line.indexOf("/*"); 
       line = line.replace(line.substring(indexOf), ""); 
       if (i == 0) 
        check = true; // means comment in the very first line 
      } 

      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("--")) { 
       int indexOf = line.indexOf("--"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 
      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("#")) { 
       int indexOf = line.indexOf("#"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 

      // At this point, all commented part is removed from the line, hence 
      // appending it to the final content 
      if (!line.isEmpty()) 
       withoutComment.append(line).append(" \n"); 
      // If line contains CREATE its a declaration line, holding it 
      // separately in the array 
      if (line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.add(line.toUpperCase()); 

       isLineDeclaration = true; 
       j++; 
      } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase()); 
       isLineDeclaration = false; 
      } 
      i++; 
     } 

     reader.close(); 
     System.out.println("Read lines " + new Date()); 
//  List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
     declareLines.removeAll(Collections.singleton(null)); 

//  content = list.toArray(new String[list.size() + 1]); 

//  withoutComment = withoutComment..toUpperCase(); 
     declareLines.add(withoutComment.toString().toUpperCase()); 
     System.out.println("Retruning uncommented content " + new Date()); 
     return declareLines; 
    }

出典

2017-02-18 18:05:40