Google Cloud Natural Language APIは実際にHTMLの解析をサポートしていますか？

私はニュースサイト&ブログから本文のコンテンツを抽出しようとしています。Google Cloud Natural Language APIは実際にHTMLの解析をサポートしていますか？

ドキュメントは、ページの生のHTML（UTF-8）とHTMLへの文書のtypeセットとしてcontentでそれをdocumentを渡すことによって、HTMLで期待通りに動作しますdocuments.analyzeSyntaxかのように見えるように。ドキュメントには、サポートされているコンテンツタイプとしてHTMLが含まれています。

実際には、結果として生じる文章とトークンは、パーザが入力がプレーンテキストであると考えるようにHTMLタグで混乱します。これは、GC NL APIを使用例として排除し、おそらく自然言語を使用してWebページを処理する多くの人々がかなり一般的な作業です。

参照のために、与えられたHTML入力（またはこの場合は入力としてHTMLページへのURL）の出力タイプのDandelion APIによるexampleがここにあります。

私の質問では、APIが間違って呼び出されたり、NL APIがHTMLをサポートしていない可能性があります。

出典

2017-06-12 fisch2

はいです。

に使用していたどのような言語を確認しますが、下記のクライアントライブラリを使用してのpythonの例ではありません：これは、htmlタグ<p>と</p>は、自然言語として処理されていないということで私の作品

from google.cloud import language 

client = language.Client() 

# document of type PLAIN_TEXT 
text = "hello" 
document_text = client.document_from_text(text) 
syntax_text = document_text.analyze_syntax() 

print("\n\ndocument of type PLAIN_TEXE:") 
for token in syntax_text.tokens: 
    print(token.__dict__) 

# document of type HTML 
html = "<p>hello</p>" 
document_html = client.document_from_html(html) 
syntax_html = document_html.analyze_syntax() 

print("\n\ndocument of type HTML:") 
for token in syntax_html.tokens: 
    print(token.__dict__) 

# document of type PLAIN_TEXT but should be HTML 
document_mismatch = client.document_from_text(html) 
syntax_mismatch = document_mismatch.analyze_syntax() 

print("\n\ndocument of type PLAIN_TEXT but with HTML content:") 
for token in syntax_mismatch.tokens: 
    print(token.__dict__)

。

あなたはすぐにgcloudコマンドラインツールを試すことができthis pageのセットアップ手順を経る場合：

gcloud beta ml language analyze-syntax --content="<p>hello</p>" --content-type="HTML"

出典

2017-06-16 19:07:24 dizcology

Google Cloud Natural Language APIは実際にHTMLの解析をサポートしていますか？

答えて

関連する問題