配列内のtermを検索し、その用語を含む配列エントリを返す

私は、単語を分析し、最もよく使われた時を特定しようとするツールを作っています。私はそうするためにGoogleのNgramデータセットを使用しています。私のコードでは、私はこのデータ（約2ギガバイト）をストリーミングしています。ストリームデータを配列に変換しています。データの各行は1つのエントリとして扱われています。私がしたいのは、データ中の特定の単語を検索し、その単語を含むすべての配列エントリを変数に格納することです。単語がデータセット内にあるかどうかを調べ、その単語（またはデータセット内のその位置）をコンソールに出力できます。私はまだプログラムを学んでいるので、私のコードが乱雑であれば心に留めておいてください。配列内のtermを検索し、その用語を含む配列エントリを返す

// imports fs (filesystem) package duh 
 
const fs = require('fs'); 
 

 
// the data stream 
 
const stream = fs.createReadStream("/Users/user/Desktop/authortest_nodejs/testdata/testdata - p"); 
 

 
// gonna use this to keep track of whether ive found the search term or not 
 
let found = false; 
 

 
// this is the term the program looks for in the data 
 
var search = "proceeded"; 
 

 
// lovely beautiful unclean way of turning my search term into regular expression 
 
var searchThing = `\\b${search}` 
 
var searchRegExp = new RegExp(searchThing, "g"); 
 

 
// starts streaming the test data file 
 
stream.on('data', function(data) { 
 

 
    // if found is false (my search term isn''t found in this data chunk), set the found variable to true or false depending on whether it found anything 
 
    if (!found) found = !!('' + data).match(searchRegExp); 
 

 
    // turns raw data to a string and tries to find the location of the search term within it 
 
    var dataLoc = data.toString().search(searchRegExp); 
 

 
    var dataStr = data.toString().match(searchRegExp); 
 

 
    // if the data search is null, continue streaming (gotta do this cuz if .match() turns up with no results it throws an error smh) 
 
    if (!dataStr) return; 
 

 
    // removes the null spots and line breaks, pretty up the displayed stuff 
 
    var dataDisplay = dataStr.toString().replace("null", " "); 
 
    var dataLocDisplay = dataLoc.toString().replace(/(\r\n|\n|\r)/gm,""); 
 

 
    // turns each line of raw data into array 
 
    var dataArray = data.toString().split("\n"); 
 

 
    // log found instances of search term (dunno why the hell id wanna do that, should fix to something useful) edit: commented it out cuz its too annoying 
 
    //console.log(dataDisplay); 
 

 
    // log location of word in string (there, more useful now?) 
 
    console.log(dataDisplay); 
 
}); 
 

 
// what happens when the stream thing returns an error 
 
stream.on('error', function(err) { 
 
    console.log(err, found); 
 
}); 
 

 
// what happens when the stream thing finishes streaming 
 
stream.on('close', function(err) { 
 
    console.log(err, found, searchRegExp); 
 
});

これは、現在（基本的には、1つの単語が百倍程度繰り返す）データにおける検索語のすべてのインスタンスを出力しますが、私は、検索を含む各行全体の出力を必要とします期間だけではなく、（「2006年5 3を進めた」、単に「進め」ではない）

出典

2017-09-20 BearbaBear

あなたの質問を更新して、期待される成果の例を示し、あなたが現在得ている結果を説明してください。 – Soviut

@Soviut all good – BearbaBear

私が理解したものから、あなたはこのような何かを探している：私はこれは非常に簡単だと思い

const fs = require('fs'); 

function grep(path, word) { 
    return new Promise((resolve) => { 
     let 
      stream = fs.createReadStream(path, {encoding: 'utf8'}), 
      buf = '', 
      out = [], 
      search = new RegExp(`\\b${word}\\b`, 'i'); 

     function process(line) { 
      if (search.test(line)) 
       out.push(line); 
     } 

     stream.on('data', (data) => { 
      let lines = data.split('\n'); 
      lines[0] = buf + lines[0]; 
      buf = lines.pop(); 
      lines.forEach(process); 
     }); 

     stream.on('end',() => { 
      process(buf); 
      resolve(out); 
     }); 
    }); 
} 

// works? 
grep(__filename, 'stream').then(lines => console.log(lines))

、bufものがあります行単位の読み取りをエミュレートする必要があります（readlineまたはその専用モジュールを使用することもできます）。

出典

2017-09-20 21:46:50 georg

よく機能しますが、タブを削除するにはどうすればよいですか？出力配列に.replace（ '\ t'、 ''）を試しましたが、まだそこにあります。 – BearbaBear

@BearbaBear： 'replace（string ...）'は一度だけ置き換えます。regexp： '.replace（/ \ t/g、 ''）'が必要です。 – georg

配列内のtermを検索し、その用語を含む配列エントリを返す

答えて

関連する問題