このスクリプト言語を解析する最も効率的な方法

私は長いテキストエディタのスクリプト言語のためのインタープリタを実装しており、レクサーが正しく動作するようにするには問題があります。ここでこのスクリプト言語を解析する最も効率的な方法

は、言語の問題の一部の例です：

T 
L /LOCATE ME/ 
C /LOCATE ME/CHANGED ME/ * * 
C ;CHANGED ME;CHANGED ME AGAIN; 1 *

/文字は、文字列を引用してもsed型構文でC（CHANGE）コマンドの区切り文字として働き、もののているように見えます任意の文字を区切り文字として使用できます。

これまでのところ、最も一般的なコマンドの半分を実装しましたが、これまではparse_tokens(line.split())を使用していました。それは素早く汚れていましたが、それは驚くほどうまくいったのです。

私自身のレクサーを書くことを避けるために、私はshlexを試しました。

それはCHANGE場合を除き、かなりうまく機能：

import shlex 

def shlex_test(cmd_str): 
    lex = shlex.shlex(cmd_str) 
    lex.quotes = '/' 
    return list(lex) 

print(shlex_test('L /spaced string/')) 
# OK! gives: ['L', '/spaced string/'] 

print(shlex_test('C /spaced string/another string/ * *')) 
# gives : ['C', '/spaced string/', 'another', 'string/', '*', '*'] 
# desired : any format that doesn't split on a space between /'s 

print(shlex_test('C ;a b;b a;')) 
# gives : ['C', ';', 'b', 'a', ';', 'a', 'b', ';'] 
# desired : same format as CHANGE command above

誰もが（shlexまたはそれ以外で）これを実現するための簡単な方法を知っていますか？

EDIT：

それが助け場合は、ここでのヘルプファイルで指定CHANGEコマンドの構文です：

''' 
C [/stg1/stg2/ [n|n m]] 

    The CHANGE command replaces the m-th occurrence of "stg1" with "stg2" 
for the next n lines. The default value for m and n is 1.'''

XとYコマンドをトークン化と同様に困難：

''' 
X [/command/[command/[...]]n] 
Y [/command/[command/[...]]n] 

    The X and Y commands allow the execution of several commands contained 
in one command. To define an X or Y "command string", enter X (or Y) 
followed by a space, then individual commands, each separated by a 
delimiter (e.g. a period "."). An unlimited number of commands may be 
placed in the X or Y command string. Once the command string has been 
defined, entering X (or Y) followed optionally by a count n will execute 
the defined command string n times. If n is not specified, it will 
default to 1.'''

出典

2012-07-19 Robbie Rosati

あなたは言語定義にアクセスできますか？もしそうなら、関連する部分の引用が私たち全員にとって役に立つかもしれません。 – Marcin

@Marcinヘルプファイルからいくつかの関連情報を追加しました。すべてのドキュメントです。 –

'shlex'は分かりませんが、' regex' [（re）]（http://docs.python.org/library/re.html）も役立つと思います。 – machaku

問題はおそらく/が引用符ではなく、区切り文字のためだけであることです。私は3番目の文字は常に区切り文字を定義するために使用されると推測しています。さらに出力に/または;は必要ありませんか？

私はLとCコマンドの場合の分割にのみ、以下に行わ：あなたが最後のリスト要素にsplit(" ")を使用することができ、オプションの" * *"部分について

>>> def parse(cmd): 
...  delim = cmd[2] 
...  return cmd.split(delim) 
... 
>>> c_cmd = "C /LOCATE ME/CHANGED ME/ * *" 
>>> parse(c_cmd) 
['C ', 'LOCATE ME', 'CHANGED ME', ' * *'] 

>>> c_cmd2 = "C ;a b;b a;" 
>>> parse(c_cmd2) 
['C ', 'a b', 'b a', ''] 

>>> l_cmd = "L /spaced string/" 
>>> parse(l_cmd) 
['L ', 'spaced string', '']

を。

>>> parse(c_cmd)[-1].split(" ") 
['', '*', '*']

出典

2012-07-19 21:19:08 sevenforce

残念ながら、それは*常に* 3番目の文字ではありませんが、このアプローチを試してみて、感謝します。 –

このスクリプト言語を解析する最も効率的な方法

答えて

関連する問題