2つの関数(ZoekAccesieCode + ZoekOrganisme)から辞書を作成する必要があります。関数ZoekAccesieCodeは、 "Q6GZX2"のような行と "Frog virus 3(隔離Goorha)"のようなZoekOrganismeを返します。 ZoekAccesieCodeはキーである必要があり、ZoekOrganismeは値である必要があります。ここに私のコードは次のとおりです。2つの関数から辞書を作成する(Python)
import re
file = open("ploop.txt")
text = file.read()
file.close()
def main():
hits = VindHits()
accesie = ZoekAccesieCode(hits)
organisme = ZoekOrganisme(hits, accesie)
MaakDict(accesie, organisme)
def VindHits():
eiwitten = text.split("\n\n")[1:]
eiwitHits = []
for eiwit in eiwitten:
if re.search(r"[AG].{4}GK[ST]", eiwit):
eiwitHits.append(eiwit)
return(eiwitHits)
def ZoekAccesieCode(hits):
for eiwit in hits:
accesieCode = re.findall(r">sp\|(.{6})", eiwit)[0]
return accesieCode
def ZoekOrganisme(hits, accesie):
for eiwit in hits:
organisme = re.findall(r"\n.+?\[(.+?)\]", eiwit)[0]
return organisme
def MaakDict(accesie, organisme):
main()
ファイルからいくつかのサンプルデータ:
Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) : [occurs frequently]
Pattern: [AG]-x(4)-G-K-[ST]
Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371
>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT
>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT
>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS
>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS
>sp|P05080|194K_TRVSY (1707 aa)
Replicase large subunit. [Tobacco rattle virus (strain SYM)]
MANGNFKLSQLLNVDEMSAEQRSHFFDLMLTKPDCEIGQMMQRVVVDKVDDMIRERKTKDPVIVHEVLSQKEQNKLMEIY
PEFNIVFKDDKNMVHGFAAAERKLQALLLLDRVPALQEVDDIGGQWSFWVTRGEKRIHSCCPNLDIRDDQREISRQIFLT
AIGDQARSGKRQMSENELWMYDQFRKNIAAPNAVRCNNTYQGCTCRGFSDGKKKGAQYAIALHSLYDFKLKDLMATMVEK
KTKVVHAAMLFAPESMLVDEGPLPSVDGYYMKKNGKIYFGFEKDPSFSYIHDWEEYKKYLLGKPVSYQGNVFYFEPWQVR
GDTMLFSIYRIAGVPRRSLSSQEYYRRIYISRWENMVVVPIFDLVESTRELVKKDLFVEKQFMDKCLDYIARLSDQQLTI
SNVKSYLSSNNWVLFINGAAVKNKQSVDSRDLQLLAQTLLVKEQVARPVMRELREAILTETKPITSLTDVLGLISRKLWK
QFANKIAVGGFVGMVGTLIGFYPKKVLTWAKDTPNGPELCYENSHKTKVIVFLSVVYAIGGITLMRRDIRDGLVKKLCDM
FDIKRGAHVLDVENPCRYYEINDFFSSLYSASESGETVLPDLSEVKAKSDKLLQQKKEIADEFLSAKFSNYSGSSVRTSP
PSVVGSSRSGLGLLLEDSNVLTQARVGVSRKVDDEEIMEQFLSGLIDTEAEIDEVVSAFSAECERGETSGTKVLCKPLTP
PGFENVLPAVKPLVSKGKTVKRVDYFQVMGGERLPKRPVVSGDNSVDARREFLYYLDAERVAQNDEIMSLYRDYSRGVIR
TGGQNYPHGLGVWDVEMKNWCIRPVVTEHAYVFQPDKRMDDWSGYLEVAVWERGMLVNDFAVERMSDYVIVCDQTYLCNN
RLILDNLSALDLGPVNCSFELVDGVPGCGKSTMIVNSANPCVDVVLSTGRAATDDLIERFASKGFPCKLKRRVKTVDSFL
MHCVDGSLTGDVLHFDEALMAHAGMVYFCAQIAGAKRCICQGDQNQISFKPRVSQVDLRFSSLVGKFDIVTEKRETYRSP
ADVAAVLNKYYTGDVRTHNATANSMTVRKIVSKEQVSLKPGAQYITFLQSEKKELVNLLALRKVAAKVSTVHESQGETFK
DVVLVRTKPTDDSIARGREYLIVALSRHTQSLVYETVKEDDVSKEIRESAALTKAALARFFVTETVLXRFRSRFDVFRHH
EGPCAVPDSGTITDLEMWYDALFPGNSLRDSSLDGYLVATTDCNLRLDNVTIKSGNWKDKFAEKETFLKPVIRTAMPDKR
KTTQLESLLALQKRNQAAPDLQENVHATVLIEETMKKLKSVVYDVGKIRADPIVNRAQMERWWRNQSTAVQAKVVADVRE
LHEIDYSSYMYMIKSDVKPKTDLTPQFEYSALQTVVYHEKLINSLFGPIFKEINERKLDAMQPHFVFNTRMTSSDLNDRV
KFLNTEAAYDFVEIDMSKFDKSANRFHLQLQLEIYRLFGLDEWAAFLWEVSHTQTTVRDIQNGMMAHIWYQQKSGDADTY
NANSDRTLCALLSELPLEKAVMVTYGGDDSLIAFPRGTQFVDPCPKLATKWNFECKIFKYDVPMFCGKFLLKTSSCYEFV
PDPVKVLTKLGKKSIKDVQHLAEIYISLNDSNRALGNYMVVSKLSESVSDRYLYKGDSVHALCALWKHIKSFTALCTLFR
DENDKELNPAKVDWKKAQRAVSNFYDW
904 - 911: GvpgcGKS
>sp|P03589|1A_AMVLE (1126 aa)
Replication protein 1a. [Alfalfa mosaic virus (strain 425/isolate Leiden)]
MNADAQSTDASLSMREPLSHASIQEMLRRVVEKQAADDTTAIGKVFSEAGRAYAQDALPSDKGEVLKISFSLDATQQNIL
RANFPGRRTVFSNSSSSSHCFAAAHRLLETDFVYRCFGNTVDSIIDLGGNFVSHMKVKRHNVHCCCPILDARDGARLTER
ILSLKSYVRKHPEIVGEADYCMDTFQKCSRRADYAFAIHSTSDLDVGELACSLDQKGVMKFICTMMVDADMLIHNEGEIP
NFNVRWEIDRKKDLIHFDFIDEPNLGYSHRFSLLKHYLTYNAVDLGHAAYRIERKQDFGGVMVIDLTYSLGFVPKMPHSN
GRSCAWYNRVKGQMVVHTVNEGYYHHSYQTAVRRKVLVDKKVLTRVTEVAFRQFRPNADAHSAIQSIATMLSSSTNHTII
GGVTLISGKPLSPDDYIPVATTIYYRVKKLYNAIPEMLSLLDKGERLSTDAVLKGSEGPMWYSGPTFLSALDKVNVPGDF
VAKALLSLPKRDLKSLFSRSATSHSERTPVRDESPIRCTDGVFYPIRMLLKCLGSDKFESVTITDPRSNTETTVDLYQSF
QKKIETVFSFILGKIDGPSPLISDPVYFQSLEDVYYAEWHQGNAIDASNYARTLLDDIRKQKEESLKAKAKEVEDAQKLN
RAILQVHAYLEAHPDGGKIEGLGLSSQFIAKIPELAIPTPKPLPEFEKNAETGEILRINPHSDAILEAIDYLKSTSANSI
ITLNKLGDHCQWTTKGLDVVWAGDDKRRAFIPKKNTWVGPTARSYPLAKYERAMSKDGYVTLRWDGEVLDANCVRSLSQY
EIVFVDQSCVFASAEAIIPSLEKALGLEAHFSVTIVDGVAGCGKTTNIKQIARSSGRDVDLILTSNRSSADELKETIDCS
PLTKLHYIRTCDSYLMSASAVKAQRLIFDECFLQHAGLVYAAATLAGCSEVIGFGDTEQIPFVSRNPSFVFRHHKLTGKV
ERKLITWRSPADATYCLEKYFYKNKKPVKTNSRVLRSIEVVPINSPVSVERNTNALYLCHTQAEKAVLKAQTHLKGCDNI
FTTHEAQGKTFDNVYFCRLTRTSTSLATGRDPINGPCNGLVALSRHKKTFKYFTIAHDSDDVIYNACRDAGNTDDSILAR
SYNHNF
838 - 845: GvagcGKT
>sp|Q9AT00|TGD3_ARATH (345 aa)
Protein TRIGALACTOSYLDIACYLGLYCEROL 3, chloroplastic. [Arabidopsis thaliana (Mouse-ear cress)]
MLSLSCSSSSSSLLPPSLHYHGSSSVQSIVVPRRSLISFRRKVSCCCIAPPQNLDNDATKFDSLTKSGGGMCKERGLEND
SDVLIECRDVYKSFGEKHILKGVSFKIRHGEAVGVIGPSGTGKSTILKIMAGLLAPDKGEVYIRGKKRAGLISDEEISGL
RIGLVFQSAALFDSLSVRENVGFLLYERSKMSENQISELVTQTLAAVGLKGVENRLPSELSGGMKKRVALARSLIFDTTK
EVIEPEVLLYDEPTAGLDPIASTVVEDLIRSVHMTDEDAVGKPGKIASYLVVTHQHSTIQRAVDRLLFLYEGKIVWQGMT
HEFTTSTNPIVQQFATGSLDGPIRY
117 - 124: GpsgtGKS
は、誰かが右のコードで私を助けることはできますか?
は、私は両方の機能、 'ZoekAccesieCode'と' ZoekOrganisme'に気づく同じ長さでそれらを作るためにスライスを使用する必要があり、同じ長さのリストである、あなただけの 'からの最初の値を返しますre.findすべての。あなたは最後に単一の要素辞書を望んでいますか? –
私はあなたがこれらの2つの関数から戻りたいと思う、単一の文字列ではなく文字列のリスト?... –