2017-03-28 10 views
0

I want to deal with a one-line translation corpus between Japanese and Chinese likeHow to put translation corpus into different files

JST_JC_ENVI-abst-06A0281759-par1-sen1 ||| C&D管理施設の高度化 ||| C&D管理设施的高度化JST_JC_ENVI-abst-06A0281759-par1-sen2 ||| メーンのポートランドはRiversideリサイクリング施設(RRF)を所有しているが,建設及び解体(C&D)ごみの埋立地に立地している。 ||| 缅因州的波特兰拥有Riverside循环使用设施(RRF),但其却位置选定于建设及解体(C&D)垃圾的填埋地。JST_JC_ENVI-abst-06A0281759-par1-sen3 ||| この施設はかさばる廃棄物,住民の出す葉やC&Dごみを受け入れているが,その最近の作業状況を紹介した。 ||| 该设施接受体积大的废弃物、居民投弃的叶子或C&D垃圾,本文介绍了该设施最近的作业情况。

The Chinese and Japanese are begin with the prefix JST_JC_ENVI-abstXXXXXXXX string and split by ||| .

So my question is how to delete all the same prefix " JST_JC_ENVI-abstXXXXXXXX " strings and output the Chinese into chinese.txt by line ,the Japanese into japanese.txt by line

Thank you.

答えて

0

First,deal with the lines and split with space.

# -*- coding: utf-8 -*- 

import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

infile=open('dev.txt','r') 
outfile1 =open('dev-mid.txt','w') 
lines = infile.read() 
i = lines.split() 
for e in i: 
    outfile1.write(e+'\n') 

then,use WORD to delete space and same prefix strings in dev-mid.txt .

At last ,

import os 


    infile=open('dev-mid.txt','r') 
    outfile1 =open('dev-in.txt','w') 
    outfile2 =open('dev-out.txt','w') 

    i=1 

    for line in infile.readlines(): 
     if i%2==1: 
    ##  print(line) 
      outfile1.write(line) 
      i+=1 
     else: 
      i+=1 
    ##  print(line) 
      outfile2.write(line) 
    infile.close() 
    outfile1.close() 
    outfile2.close() 

Dealing with even and odd rows. dev-in.txt is Japanese and dev-out.txt is Chinese:-D