エンコードされた文字を文字列リテラルに置き換えるにはどうすればよいですか？ \ uDXYZWのようなものか？

私のようなチルダで文字を置き換えるために、Javaで次のコードを持っている：エンコードされた文字を文字列リテラルに置き換えるにはどうすればよいですか？ uDXYZWのようなものか？

á é í ó ú Á É Í Ó Ú à è ì ò ù À È Ì Ò Ù 

text = text.replace("Ã¡", "a"); 
    text = text.replace("Ã©", "e"); 
    text = text.replace("Ã", "i"); 
    text = text.replace("Ã³", "o"); 
    text = text.replace("Ãº", "u"); 

    // caracteres raros: tildes mayusculas 
    text = text.replace("Ã", "A"); 
    text = text.replace("Ã‰", "E"); 
    text = text.replace("Ã", "I"); 
    text = text.replace("Ã“", "O"); 
    text = text.replace("Ãš", "U"); 


    // caracteres raros: tildes inversas minusculas 
    text = text.replace("Ã ", "a"); 
    text = text.replace("Ã¨", "e"); 
    text = text.replace("Ã¬", "i"); 
    text = text.replace("Ã²", "o"); 
    text = text.replace("Ã¹", "u"); 

    // caracteres raros: tildes inversas mayusculas 
    text = text.replace("Ã€", "A"); 
    text = text.replace("Ãˆ", "E"); 
    text = text.replace("ÃŒ", "I"); 
    text = text.replace("Ã’", "O"); 
    text = text.replace("Ã™", "U"); 

    // caracteres raros: ñ minuscula y mayuscula 
    text = text.replace("Ã‘", "n"); 
    text = text.replace("Ã±", "N");

を私はのような表記を使用したい：

text = text.replace("\uD1232", "N");

しかし、私はどこのテーブルを見つけるために知っていません... Ã€, Ãˆ, ÃŒ ...

出典

2017-05-16 Jacob Jimenez

あなたはこれを手動で行うべきではありません、使用[ 'Normalizer']（HTTP ：//docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html）を参照してください。それがそのために設計されたものです。 –

[簡単な方法で文字列からUTF-8アクセントを削除できますか？]（http://stackoverflow.com/questions/15190656/easy-way-to-remove-utf-8-accents-from-a-string ） –

JDKにはnative2asciiという名前のツールが含まれています。

特殊文字を含むUTF-8エンコーディングでテキストファイルを作成します。例えば

in.txtファイル：次に

á é í ó ú Á É Í Ó Ú à è ì ò ù À È Ì Ò Ù

呼び出す：

native2ascii -encoding UTF-8 in.txt out.txt

をファイルout.txtはそのようなエスケープシーケンスが含まれていた後：

\u00e1 \u00e9 \u00ed \u00f3 \u00fa \u00c1 \u00c9 \u00cd \u00d3 \u00da \u00e0 \u00e8 \u00ec \u00f2 \u00f9 \u00c0 \u00c8 \u00cc \u00d2 \u00d9

出典

2017-05-16 22:13:50 vanje

一部は、もともとのようですISO-8859-1と誤って解釈されるUTF-8でエンコードされたテキスト（ラテン語-1）などである。あなたはN/N明らかに混同されて見ることができるように

public static void main(String[] args) throws IOException { 
    p1("Ã ", "a"); 
    p1("Ã\u00a0", "a"); // Non-breaking space instead 
    p1("Ã¨", "e"); 
    p1("Ã¬", "i"); 
    p1("Ã²", "o"); 
    p1("Ã¹", "u"); 

    // caracteres raros: tildes inversas mayusculas 
    p1("Ã€", "A"); 
    p1("Ãˆ", "E"); 
    p1("ÃŒ", "I"); 
    p1("Ã’", "O"); 
    p1("Ã™", "U"); 

    // caracteres raros: ñ minuscula y mayuscula 
    p1("Ã‘", "n"); 
    p1("Ã±", "N"); 
} 

static void p1(String s, String t) { 
    String v = new String(s.getBytes(StandardCharsets.ISO_8859_1), 
      StandardCharsets.UTF_8); 
    String u = Normalizer.normalize(v, Normalizer.Form.NFD) 
      .replaceAll("\\pM", ""); 
    if (u.equalsIgnoreCase(t)) { 
     System.out.printf("[1] %s -> %s :: %s%n", v, u, t); 
    } else { 
     p2(s, t); 
    } 
} 

static void p2(String s, String t) { 
    String v = new String(s.getBytes(Charset.forName("Windows-1252")), 
      StandardCharsets.UTF_8); 
    String u = Normalizer.normalize(v, Normalizer.Form.NFD) 
      .replaceAll("\\pM", ""); 
    System.out.printf("[2] %s -> %s :: %s%n", v, u, t); 
} 

[2] � -> � -> a 
[1] à -> a :: a 
[1] è -> e :: e 
[1] ì -> i :: i 
[1] ò -> o :: o 
[1] ù -> u :: u 
[2] À -> A -> A 
[2] È -> E -> E 
[2] Ì -> I -> I 
[2] Ò -> O -> O 
[2] Ù -> U -> U 
[2] Ñ -> N -> n 
[1] ñ -> n :: N

：

次はそれを修復するために成功し試みです。スペースのある最初のエントリは明らかに破損しています。 s = s.replace(' ', '\u00a0');となります。

上記のコードでは、アクセント付きの文字を基本文字に分割して発音記号を組み合わせることで、アクセントを捨てるためにノーマライザーを使用しています。後者をreplaceAllで削除します。 UTF-8は

ISO-8859-1がラテン-1、UTF-8

のWindows-1252のサブセットでのUnicode文字セットであるWindowsのラテン-1、のLatin-の "スーパーセット"

です1。

（上記のコードは最高の驚きを持っていないのUTF-8エンコーディングでJavaソースで編集してコンパイルされる可能性があります。）

出典

2017-05-16 23:00:47

エンコードされた文字を文字列リテラルに置き換えるにはどうすればよいですか？ \ uDXYZWのようなものか？

答えて

関連する問題