Python - HTMLコメントがシリアル化されない

htmlページを解析してデータベースに保存しようとしています。ページのタグでjsonを作成する。Python - HTMLコメントがシリアル化されない

タグの中には、これは通常のタグ項目であり、それに問題がない

<script type="text/javascript">RegisterSod("search.js", "");</script><script type="text/javascript" language="JavaScript" defer="defer"> 
<!-- 
function SearchEnsureSOD() { EnsureScript('search.js',typeof(GoSearch)); } _spBodyOnLoadFunctionNames.push('SearchEnsureSOD');function SB420AF5B_Submit() 
. 
. 
. 
{ document.getElementById('ctl00_region_header_region_headerLinks_helpAreaID_ctl00_ctl00_SB420AF5B_InputKeywords').value=''; }} 
// --> 
</script>

のようなJavaScriptを含んでいます。

{'tag': 'div', 'unqid': '.....', 'id': 'newsContent0'}

しかし、JavaScriptのタグを私は取得していますエラー

{'text': 'IK F uu ph---------------------', 'tag': <cyfunction Comment at 0x00000000027A79A0>, 'unqid': '.....'}

これは私のコードです：私はダンプすることができますどのように

ac = requests.get(url) 
html_text = ac.text 
lx = html.fromstring(html_text) 
...some parsing codes 

json.dumps(items).decode('utf-8') --> where I am getting error

エラーが

Traceback (most recent call last): 
    File "main3.py", line 132, in <module> 
    PageRunner(url) 
    File "main3.py", line 122, in PageRunner 
    InsertPageTags(1, url) 
    File "main3.py", line 58, in InsertPageTags 
    parameter = (WebsiteID, Url, json.dumps(items).decode('utf-8')) 
    File "C:\Python27\lib\json\__init__.py", line 244, in dumps 
    return _default_encoder.encode(obj) 
    File "C:\Python27\lib\json\encoder.py", line 207, in encode 
    chunks = self.iterencode(o, _one_shot=True) 
    File "C:\Python27\lib\json\encoder.py", line 270, in iterencode 
    return _iterencode(o, 0) 
    File "C:\Python27\lib\json\encoder.py", line 184, in default 
    raise TypeError(repr(o) + " is not JSON serializable") 
TypeError: <cyfunction Comment at 0x00000000029279A0> is not JSON serializable

を下回っていますコメント付きのhtmlまたはhtからのコメントを削除するml？

出典

2016-12-15 Halislus

コメントを削除しますか？ <！ - タグを取り出してください。 –

@Halislus - URLは何ですか？また、解析コードのいくつかがありますので、答えの作業例をテストできますか？ – David542

代わりに前の回答のようにJavaScriptを使用しての、あなたはPythonで正規表現を使用して機能を使用することができます。

import re 

def js_comment_clean(js): 
    js = re.sub("<!--[\\s\\S]*?(?:-->)?","",js) 
    js = re.sub("<!--[\\s\\S]*?-->?","",js) 
    js = re.sub('<!---+>?','',js) 
    js = re.sub("|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?","",js) 
    js = re.sub("|<[?][^>]*>?","",js) 
    return js

ので、

html_text = js_comment_clean(ac.text)

と

html_text = ac.text

：、あなたの元の行を変更

出典

2016-12-17 20:47:35

基本的に、python jsonデコーダは<cyfunction ...>で何をすべきか分からないため、エラーが発生します。あなたはカスタマイズされたjsonデコーダを書く必要があります：https://docs.python.org/2/library/json.html#json.JSONDecoder。

タグがすべて<some_text>の形式であることが分かっている場合は、最初に正規表現を空の文字列などで置き換えることができます。ここで（Remove HTML comments with Regex, in Javascript）この回答から正規表現を取ると、それは次のようになります。

var COMMENT_PSEUDO_COMMENT_OR_LT_BANG = new RegExp(
'<!--[\\s\\S]*?(?:-->)?' 
+ '<!---+>?' // A comment with no body 
+ '|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?' 
+ '|<[?][^>]*>?', // A pseudo-comment 
'g');

出典

2016-12-15 22:31:29 David542

Python - HTMLコメントがシリアル化されない

答えて

関連する問題