Python - テキスト入力を別々の要素に分割する方法

入力が改行と矛盾するので、改行文字をある種の区切り文字として使用することはできません。入ってくるテキストは、次の形式になります。Python - テキスト入力を別々の要素に分割する方法

IDNumber姓スコアレター場所

IDNumber：9つの数字

スコア：0-100

手紙：A又はB

場所：州名の略称から完全に都市名および州名までのものであれば何でも構いません。これはオプションです。

例：

123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

要素は次のようになります。

123456789 John Doe 90 A New York City 
987654321 Jane Doe 70 B CAL 
432167895 John Cena 60 B FL 
473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

私は個別にそれぞれの人のための各要素にアクセスする必要があります。したがって、John Cenaオブジェクトの場合、私はID：432167895、ファーストネーム：John、姓：Cena、BまたはA：Bにアクセスできる必要があります。入力の一部になります。

編集：私は正規表現などのモジュールをインポートすることは許可されていません。

出典

2017-04-19 Jackson Blankenship

入力が文字列である場合、私は[空白文字で文字列を分割する]によって開始する（http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python）。 –

をこれらの結果を与えた考えです。

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR" 

#split by whitespaces 
output = input.split() 

#create output to store as dictionary this could then be dumped to a json file 
data = {'output':[]} 
end = len(output) 

i=0 

while i< end: 
    tmp = {} 
    tmp['id'] = output[i] 
    i=i+1 
    tmp['fname']=output[i] 
    i=i+1 
    tmp['lname']=output[i] 
    i=i+1 
    tmp['score']=output[i] 
    i=i+1 
    tmp['letter']=output[i] 
    i=i+1 
    location = "" 
    #Catch index out of bounds errors 
    try: 
     bool = output[i].isdigit() 
     while not bool: 
      location = location + " " + output[i] 
      i=i+1 
      bool = output[i].isdigit() 
    except IndexError: 
     print('Completed Array') 

    tmp['location'] = location 
    data['output'].append(tmp) 

print(data)

出典

2017-04-19 21:54:41

場所が指定されていない場合を除いて、これは完璧に機能します！あなたはそれを修正する方法を知っていますか？ location要素はオプションです。 –

何もない場合は空の文字列を置くだけです。 –

あなたは、必要に応じて一緒に言葉を服用し、場所をスキップし、9桁の番号で開始する各レコードを必要とする正規表現を、使用することができます

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Resultは次のとおりです。

[('123456789', 'John', 'Doe', '90', 'A'), 
('987654321', 'Jane', 'Doe', '70', 'B'), 
('432167895', 'John', 'Cena', '60', 'B'), 
('473829105', 'Donald', 'Trump', '70', 'E'), 
('098743215', 'Bernie', 'Sanders', '92', 'A')]

出典

2017-04-19 21:17:14 trincot

空白に分割する場所を識別するために有用ではないので、私が直接正規表現のために行くだろう：

import re 

input_string = """123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR""" 

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+") 
person_list = re.findall(search_string, input_string)

この収率：正規表現内のグループの

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'), 
('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'), 
('432167895', 'John', 'Cena', '60', 'B', 'FL')]

説明：

ID：9桁（少なくとも1つの空白文字が続く）
姓名：2つの独立しました少なくとも1つの空白で区切られた文字のグループ（その後に少なくとも1つの空白が続く）
スコア：1,2,3桁（少なくとも1つの空白が続く）
手紙：AまたはB（少なくとも1つの空白が続く）
場所：（少なくとも1つの空白が続く）文字のグループ

出典

2017-04-19 21:20:40

あなたはID番号がであることを行っている知っているので、各「記録」の開始と9桁の長さ、9桁のID番号によって分割を試してみてください。

# Assuming your file is read in as a string s: 
import re 
records = re.split(r'[ ](?=[0-9]{9}\b)', s) 

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...} 
record_locator = {} 

field_names = ['ID', 'FirstName', 'LastName', 'Letter'] 

# Get the individual records and store their values: 
for record in records: 

    # You could filter the record string before doing this if it contains newlines etc 
    values = record.split(' ')[:5] 

    # Discard the int after the name eg. 90 in the first record 
    del values[3] 

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead 
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

その後の情報にアクセスするには：

print record_locator['John Doe']['ID'] # 987654321

出典

2017-04-19 21:26:00 sgrg

私は9桁の数字で分割しようとするのが最善の選択肢かもしれないと思います。そこにこれを行うには、よりエレガントな方法は、おそらくですが、以下の例の文字列の入力に基づいて

import re 

with open('data.txt') as f: 
    data = f.read() 
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data) 
    results = list(filter(None, results)) 
    print(results)

は私に

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

出典

2017-04-19 21:31:32 davidejones

Python - テキスト入力を別々の要素に分割する方法

答えて

関連する問題