文字数ではなくバイト数に基づくSubstr

フィールド最大値が200バイトしかない入力システムを作成しています。私は（あまりにも、アップの議論のために、この方法かもしれませんが！）、次を使用してバイトの残りの数をカウントしています：文字数ではなくバイト数に基づくSubstr

var totalBytes = 200; 
var $newVal = $(this).val(); 
var m = encodeURIComponent($newVal).match(/%[89ABab]/g); 
var bytesLeft = totalBytes - ($newVal.length + (m ? m.length : 0));

これがうまく動作するように見える、誰かがデータの大きな塊にペーストしたが場合、私は入力をスライスし、その200バイトのみを表示できるようにしたい。私は擬似コードでは次のようになります：

$newText = substrBytes($string, 0, 200);

何か助けや指導をいただければ幸いです。

編集は：ここで起こってすべてがところで

編集2 :)をUTF-8である：私は、私はループすべての文字をできることを承知していると評価し、私は何かがあるかもしれない期待していたと思いますこれを少しでも優雅に処理してください。

ありがとうございます！

出典

2012-04-18 Slazlaa

バイトとして、あなたの入力を処理し、テキストではない理由は何ですか？ http://www.w3schools.com/jsref/jsref_obj_string.asp – jazzytomato

私は間違っているかもしれませんが、文字コードの間にある種のiconvのような変換が必要になるという印象があります。簡単に聞こえません。 –

これが接続されているシステムでは、テキストペイロードのサイズが200バイト以下である必要があります。 – Slazlaa

Google検索では、が見つかりました。試してみてください。入力ボックスがあります。私はここでコードをコピーしています.SOはリンクではなく確定的な回答を好むので、クレジットはMcDowellになります。

/** 
* codePoint - an integer containing a Unicode code point 
* return - the number of bytes required to store the code point in UTF-8 
*/ 
function utf8Len(codePoint) { 
    if(codePoint >= 0xD800 && codePoint <= 0xDFFF) 
    throw new Error("Illegal argument: "+codePoint); 
    if(codePoint < 0) throw new Error("Illegal argument: "+codePoint); 
    if(codePoint <= 0x7F) return 1; 
    if(codePoint <= 0x7FF) return 2; 
    if(codePoint <= 0xFFFF) return 3; 
    if(codePoint <= 0x1FFFFF) return 4; 
    if(codePoint <= 0x3FFFFFF) return 5; 
    if(codePoint <= 0x7FFFFFFF) return 6; 
    throw new Error("Illegal argument: "+codePoint); 
} 

function isHighSurrogate(codeUnit) { 
    return codeUnit >= 0xD800 && codeUnit <= 0xDBFF; 
} 

function isLowSurrogate(codeUnit) { 
    return codeUnit >= 0xDC00 && codeUnit <= 0xDFFF; 
} 

/** 
* Transforms UTF-16 surrogate pairs to a code point. 
* See RFC2781 
*/ 
function toCodepoint(highCodeUnit, lowCodeUnit) { 
    if(!isHighSurrogate(highCodeUnit)) throw new Error("Illegal argument: "+highCodeUnit); 
    if(!isLowSurrogate(lowCodeUnit)) throw new Error("Illegal argument: "+lowCodeUnit); 
    highCodeUnit = (0x3FF & highCodeUnit) << 10; 
    var u = highCodeUnit | (0x3FF & lowCodeUnit); 
    return u + 0x10000; 
} 

/** 
* Counts the length in bytes of a string when encoded as UTF-8. 
* str - a string 
* return - the length as an integer 
*/ 
function utf8ByteCount(str) { 
    var count = 0; 
    for(var i=0; i<str.length; i++) { 
    var ch = str.charCodeAt(i); 
    if(isHighSurrogate(ch)) { 
     var high = ch; 
     var low = str.charCodeAt(++i); 
     count += utf8Len(toCodepoint(high, low)); 
    } else { 
     count += utf8Len(ch); 
    } 
    } 
    return count; 
}

出典

2012-04-18 12:03:22

このスニペットは非常に興味深いですが、結果はソースコードを保持するファイルのエンコーディングによって変化するようです。しかし、おそらくフォームのために素晴らしいです。 –

このコードスニペットは実際に私が探していたスライス機能を持っていませんが、完全にバイトカウントを処理します。ありがとう！ :) – Slazlaa

JavaScriptの文字列は、内部的にはUTF-16で表されているため、実際にはすべての文字が2バイトとなります。 SOの質問は、 "UTF-8でstrのバイト長を取得する"のようなものです。

シンボルの半分はほとんど必要ないので、198バイトまたは199バイトをカットすることがあります。

Here're 2異なる解決策：

// direct byte size counting 
function cutInUTF8(str, n) { 
    var len = Math.min(n, str.length); 
    var i, cs, c = 0, bytes = 0; 
    for (i = 0; i < len; i++) { 
     c = str.charCodeAt(i); 
     cs = 1; 
     if (c >= 128) cs++; 
     if (c >= 2048) cs++; 
     if (n < (bytes += cs)) break; 
    } 
    return str.substr(0, i); 
} 

// using internal functions, but is not very fast due to try/catch 
function cutInUTF8(str, n) { 
    var encoded = unescape(encodeURIComponent(str)).substr(0, n); 
    while (true) { 
     try { 
      str = decodeURIComponent(escape(encoded)); 
      return str; 
     } catch(e) { 
      encoded = encoded.substr(0, encoded.length-1); 
     } 
    } 
}

出典

2012-04-18 12:04:28 kirilloid

最初の関数は、 'if（n

両方の機能を修正しました – kirilloid

これは素晴らしい答えです。私はそれらを両方とも正しいものにすることができたら、私はそうするでしょう！ありがとう！ – Slazlaa

文字数ではなくバイト数に基づくSubstr

答えて

関連する問題