xml属性/タグの無効なユニコード文字

xml属性（タグ）の無効なユニコード文字のリストは何ですか？xml属性/タグの無効なユニコード文字

次のpython3コードが示すように：

import xml.etree.ElementTree as ET 
from io import StringIO as sio 

xml_dec = '<?xml version="1.1" encoding="UTF-8"?>' 
unicode_text = '<root>textº</root>' 
valid_unicode = '<标签 属性="值">文字</标签>' 
invalid_unicode_attribute = '<tag attributeº="value">text</tag>' 
invalid_unicode_tag = '<tagº>text</tagº>' 

ET.parse(sio(xml_dec + unicode_text)) 
# works 

ET.parse(sio(xml_dec + valid_unicode)) 
# works 

ET.parse(sio(xml_dec + invalid_unicode_attribute)) 
# ParseError 

ET.parse(sio(xml_dec + invalid_unicode_tag)) 
# ParseError

Unicode文字º、すなわちU+00BAそれは要素テキストではなく、要素の属性またはタグである場合、構文解析することができます。一方、中国語などの他のユニコード文字は、要素属性とタグで解析できます。

私はhttps://validator.w3.org/checkでXML <?xml version="1.1" encoding="UTF-8"?><tagº>text</tagº>をチェックし、それがエラーを与える：

Line 1, Column 43: character "º" not allowed in attribute specification list

しかし、XML Recommendation 1.1, §2.2 Charactersで、それはそれは許可されていると言う：

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

私の質問はどこ私ができる、ですXML属性/タグで無効なUnicode文字のリストを探しますか？タグに使用できる文字については

出典

2017-04-06 azalea

これは今の属性やタグ名についてですか？タイトルと最後の文章では属性について説明していますが、例はテキストとタグのみです。 – lenz

いずれにしても、自分がリンクしている文書の一部をスクロールするだけで済みます。たとえば、[ここ]（https://www.w3.org/TR/xml11/#NT-NameStartChar）は、タグ名に使用できる文字の定義です。 – lenz

用語を理解していれば、そのような質問への回答が得やすくなります。タグの例です： ''。それは2つの名前（要素名と属性名）と、属性値、空白、等号、アポストロフィなどを含むさまざまなものを含んでいます。あなたの質問は、どの文字がタグに許されているのではなく、要素名と属性名に使用できます。 –

および属性名、W3C recommendationは（これに自分でリンク - あなたがテキストノードで使用することができるものの定義を見ていましたが）以下の状態：

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.

そして

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

これは、Unicodeの多くは、範囲一覧表示されますformal definitionが続いている：

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | 
        [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | 
        [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
        [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
        [#x10000-#xEFFFF] 
NameChar  ::= NameStartChar | "-" | "." | [0-9] | #xB7 | 
        [#x0300-#x036F] | [#x203F-#x2040] 
Name   ::= NameStartChar (NameChar)*

男性序列番号º（#xBA）は何らかの理由で（少なくともいくつかの言語は一般的な単語の略語でそれを使用しているため、私の "区切り文字"のようには見えません）。

タグ名に数字、ハイフン、ピリオドを使用できますが、最初の文字として使用することはできません。

出典

2017-04-06 19:49:58 lenz

答えて

関連する問題