Spark UDF内でXML文字列を処理して構造体フィールドを返す

Body（String）という名前のデータフレーム列があります。本文列のデータは次のようになりますSpark UDF内でXML文字列を処理して構造体フィールドを返す

<p>I want to use a track-bar to change a form's opacity.</p> 

<p>This is my code:</p> 

<pre><code>decimal trans = trackBar1.Value/5000; 
this.Opacity = trans; 
</code></pre> 

<p>When I build the application, it gives the following error:</p> 

<blockquote> 
    <p>Cannot implicitly convert type 'decimal' to 'double'.</p> 
</blockquote> 

<p>I tried using <code>trans</code> and <code>double</code> but then the 
control doesn't work. This code worked fine in a past VB.NET project. </p> 
,While applying opacity to a form should we use a decimal or double value?

本文を使用しています。コードとテキストを別々に2つ用意します。コードはcodeという名前の要素の間にあり、textは他のすべてです。

私はこれが機能していません。この

case class bodyresults(text:String,code:String) 
val Body:String=>bodyresults=(body:String)=>{ val xmlbody=scala.xml.XML.loadString(body) 
val code = (xmlbody \\ "code").toString; 
val text = "I want every thing else as text. what should I do" 
(text,code) 
} 
val bodyudf=udf(Body) 
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))

のように見えるUDFを作成しました。私の質問はです。ご覧のとおり、データにルートノードはありません。私はまだスカラーXML解析を使用できますか？ 2.コード以外のテキストをテキストに解析する方法。

何かが間違って私は

の予想される出力知らせてください私のコードに存在する場合：

(code,text) 
code = decimal trans = trackBar1.Value/5000;this.Opacity = trans;trans double 
text = everything else

出典

2017-09-20 Makkena

もしあればエラー？期待される成果は？ – philantrovert

スパークシェルでは、エラーメッセージが表示されません。 UDF本体に何か問題があります。 spark-shellは関数を作成していません。 – Makkena

よろしいですか。コードタグは複数の場所にあります。それらのすべて、あるいは 'pre '、すなわち' decimal trans = ... 'の中にあるものだけを望みますか？ – philantrovert

の代わりに置き換えることを、あなたも空にしRewriteRuleを使用して、XMLクラスのtransformメソッドをオーバーライドすることができますあなたのxmlの<pre>タグ。

case class bodyresults(text:String,code:String) 

val bodyudf = udf{ (body: String) => 

    // Appending body tag explicitly to the xml before parsing 
    val xmlElems = XML.loadString(s""" <body> ${body} </body> """) 
    // extract the code inside the req 
    val code = (xmlElems \\ "body" \\ "pre" \\ "code").text 

    val text = (xmlElems \\ "body").text.replaceAll(s"${code}" ,"") 

    bodyresults(text, code) 
}

このUDFは次のようにStructTypeを返します。特定の列を抽出するには

val posts5 = df.withColumn("codetext", bodyudf($"xml")) 
posts5: org.apache.spark.sql.DataFrame = [xml: string, codetext: struct<text:string,code:string>]

：あなたが好きになりましたあなたにposts5データフレームを、それを呼び出すことができます

org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,StructType(StructField(text,StringType,true), StructField(code,StringType,true)),List(StringType))

posts5.select($"codetext.code").show 
+--------------------+ 
|    code| 
+--------------------+ 
|decimal trans = t...| 
+--------------------+

出典

2017-09-20 07:17:26 philantrovert

ありがとうございました。私は今それを理解しています。私はあなたの答えをupvoteするのに十分な評判がありません。 – Makkena

これを実装しようとするとエラーが発生しますSAXParseException：エンティティ "nbsp"が参照されましたが、宣言されていません。私は<？xml version = "1.0" encoding = "utf-8"？>を文字列に追加しましたが、それは動作しません。それについて何か知っていますか？ – Makkena

'<！ENTITY nbsp" ">' '<？xml version ..>'の後ろに置いて、動作するかどうか確認してください。 – philantrovert

Spark UDF内でXML文字列を処理して構造体フィールドを返す

答えて

関連する問題