スカラーでn-gramを生成するには？

私はスカラーのnグラムに基づいて解離したプレスアルゴリズムをコーディングしようとしています。大きなファイルのためにnグラムを生成する方法：たとえば、 "蜂はミツバチの蜂"を含むファイルのために。スカラーでn-gramを生成するには？

まず、ランダムなnグラムを選択する必要があります。たとえば、ハチ。
次に、（n-1）ワードで始まるnグラムを探す必要があります。たとえば、の蜂。
このnグラムの最後の単語を出力します。その後、繰り返します。

どうすればいいですか？ご迷惑をおかけして申し訳ございません。

出典

2011-11-24 user1002579

Iはnグラムが何であるかを知りません。無作為に単語を選んでいますか？またはいくつかのロジックがありますか？ – santiagobasulto

@santiagobasulto Wikipediaはあなたの友人です：http://en.wikipedia.org/wiki/N-gram –

これは万一、http://stackoverflow.com/questions/8256830/how-to-make-stringに関連していますか？ -sequence-in-scala？ –

あなたの質問はもう少し具体的かもしれませんが、ここで私の試しです。

val words = "the bee is the bee of the bees" 
words.split(' ').sliding(2).foreach(p => println(p.mkString))

出典

2011-11-24 15:08:46 peri4n

これはあなたに2グラムしか与えません。 nグラムが必要な場合は、nをパラメータ化する必要があります。 – tuxdna

あなたがここにn個のパラメータ

val words = "the bee is the bee of the bees" 
val w = words.split(" ") 

val n = 4 
val ngrams = (for(i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x) 
ngrams foreach println 

List(the) 
List(bee) 
List(is) 
List(the) 
List(bee) 
List(of) 
List(the) 
List(bees) 
List(the, bee) 
List(bee, is) 
List(is, the) 
List(the, bee) 
List(bee, of) 
List(of, the) 
List(the, bees) 
List(the, bee, is) 
List(bee, is, the) 
List(is, the, bee) 
List(the, bee, of) 
List(bee, of, the) 
List(of, the, bees) 
List(the, bee, is, the) 
List(bee, is, the, bee) 
List(is, the, bee, of) 
List(the, bee, of, the) 
List(bee, of, the, bees)

出典

2013-05-24 09:58:58 tuxdna

でこれを試すことは、ストリームベースのアプローチです。これは、nグラムを計算する際にあまりに多くのメモリを必要としません。

object ngramstream extends App { 

    def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match { 
    case x #:: xs => { 
     f(x) 
     process(xs)(f) 
    } 
    case _ => Stream[Array[String]]() 
    } 

    def ngrams(n: Int, words: Array[String]) = { 
    // exclude 1-grams 
    (2 to n).map { i => words.sliding(i).toStream } 
     .foldLeft(Stream[Array[String]]()) { 
     (a, b) => a #::: b 
     } 
    } 

    val words = "the bee is the bee of the bees" 
    val n = 4 
    val ngrams2 = ngrams(n, words.split(" ")) 

    process(ngrams2) { x => 
    println(x.toList) 
    } 

}

OUTPUT：

List(the, bee) 
List(bee, is) 
List(is, the) 
List(the, bee) 
List(bee, of) 
List(of, the) 
List(the, bees) 
List(the, bee, is) 
List(bee, is, the) 
List(is, the, bee) 
List(the, bee, of) 
List(bee, of, the) 
List(of, the, bees) 
List(the, bee, is, the) 
List(bee, is, the, bee) 
List(is, the, bee, of) 
List(the, bee, of, the) 
List(bee, of, the, bees)

出典

2013-12-17 12:48:58 tuxdna

私はそれが好きです、 'プロセス'の有用性についてはわかりません。どうして 'ngrams（...）をしないのですか？foreach（x => println（x.toList））'？ – Mortimer

@モーティマー：面白い質問。 'process'は単なる追加機能です。私たちは間違いなく 'ngrams2 foreach {x => println（x.toList）}'を使うことができます。ありがとう:-) – tuxdna

スカラーでn-gramを生成するには？

答えて

関連する問題