2017-01-19 19 views
0

私は私がこのどのようにPythonでテキストファイルからURLを抽出するのですか?

https://tse3.mm.bing.net///th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\ 
https:\\/\\/tse1.mm.bing.net\\/th?id=OIP.M7ff1f4e880bac2c244c0b6a286cee669o2&pid=Api\ 

のような出力をしたい、このコード

def get_net_target(page): 
    start_link=page.find("thumbnailUrl") 
    start_quote=page.find('"',start_link) 
    end_quote=page.find('"',start_quote+1) 
    url=page[start_quote+1:end_quote] 
    print url 

my_file = open("data.txt") 
page = my_file.read() 

print(get_net_target(page)) 

を使用し、私は

thumbnailUrl\": \ 

で始まるURLを抽出したいURLとテキストのフルテキストファイルを持っています....

ただし、私には:

None 
データの 数行がある

...

webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=RUc0BARkL2P78A5CI7XPWqhCYAA2XaQLP-fHGdfODEY&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F%26simid%3d607996336242885612&p=DevEx,5006.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api\", \"datePublished\": \"2011-07-08T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=gA9S9qCIF1jvD5yA4V9VOqfrJUxdW2_wyacSDR15Yc8&v=1&r=http%3a%2f%2fwww.forumpakistan.com%2fimages%2fcelebrity-profiles%2fShoaib-Malik-1.jpg&p=DevEx,5008.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=IODAmtxi3pYzDGhiJcJgCv0fWHEq8hlJauGxRW5o2c4&v=1&r=http%3a%2f%2fok-khan.blogspot.com%2f2011%2f07%2fshoaib-malik.html&p=DevEx,5007.1\", \"contentSize\": \"48445 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"ok-khan.blogspot.com\\/2011\\/07\\/shoaib-malik.html\", \"width\": 500, \"height\": 647, \"thumbnail\": {\"width\": 231, \"height\": 300}, \"imageInsightsToken\": \"ccid_4Zggq2i0*mid_97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F*simid_607996336242885612\", \"imageId\": \"97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F\", \"accentColor\": \"3A6491\"}, {\"name\": \"Pakistani Crickert Player: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=4qc04BUbtNDwiCHco5m3IY_YFqKVaY2q8ZWhX-DvFQs&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dF690295FD18526BA8225367169A0664405923A09%26simid%3d608039315980946676&p=DevEx,5012.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\", \"datePublished\": \"2012-12-24T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=9psh5pXKn2R_2Zn4-iMzpjDFePVuLSNVJhbVjf2uTI0&v=1&r=http%3a%2f%2fi1.tribune.com.pk%2fwp-content%2fuploads%2f2010%2f10%2fshoaib-malik-640x480.jpg&p=DevEx,5014.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=-cUvEUoDmZ1OAI-PVQc4MOfS-ELdt5Im521SJ2ZP4j8&v=1&r=http%3a%2f%2fpakistanicricketplayr44410.blogspot.com%2f2012%2f12%2fshoaib-malik.html&p=DevEx,5013.1\", \"contentSize\": \"51986 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"pakistanicricketplayr44410.blogspot.com\\/2012\\/12\\/shoaib-malik.html\", \"width\": 640, \"height\": 480, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_y7VohZKB*mid_F690295FD18526BA8225367169A0664405923A09*simid_608039315980946676\", \"imageId\": \"F690295FD18526BA8225367169A0664405923A09\", \"accentColor\": \"98AE1D\"}, {\"name\": \"Pakistani Cricket Players: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=n2Lkz5bg7h-AgbmZE4SnL-_AFBcCgc-_vaiVeAuC84s&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2%26simid%3d608028569977424814&p=DevEx,5018.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api\", \"datePublished\": \"2011-04-17T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=TwpcQHy-RdAJUStMisg6zBtjt_j60EStRFRAJS1D69Q&v=1&r=http%3a%2f%2fimages.teamtalk.com%2f08%2f10%2f800x600%2fShoaib-Malik_1264846.jpg&p=DevEx,5020.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=xICbhyFdmUBblBavcA3pXPdpbOa-1bJuBvP5H6Z0kms&v=1&r=http%3a%2f%2fcricketplayerspk.blogspot.com%2f2011%2f04%2fshoaib-malik.html&p=DevEx,5019.1\", \"contentSize\": \"51243 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"cricketplayerspk.blogspot.com\\/2011\\/04\\/shoaib-malik.html\", \"width\": 800, \"height\": 600, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_tspl7aV4*mid_320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2*simid_608028569977424814\", \"imageId\": \"320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2\", \"accentColor\": \"416838\"}, {\"name\": \"Shoaib Malik in line for Test comeback after 5 years - Sports\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7CIa0gvwncEquihLMmMIvtYAAUYZutf8EQr57d8EDO0&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d8045A5C7203C2203C8238D9E00905FCB328BD4D9%26simid%3d608033376034882300&p=DevEx,5024.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api\", \"datePublished\": \"2015-10-06T04:07:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=F2RLPPSfrErnxq7OZt_3mbKbvpJITet7f_kGd90aKlg&v=1&r=http%3a%2f%2fimages.mid-day.com%2fimages%2f2015%2foct%2f6Shoaib-Malik-1.jpg&p=DevEx,5026.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=3V02TER99J6fm2eshh_cv4NCdJELV1DpI1pOmALtDMQ&v=1&r=http%3a%2f%2fwww.mid-day.com%2farticles%2fshoaib-malik-in-line-for-test-comeback-after-5-years%2f16586181&p=DevEx,5025.1\", \"contentSize\": \"119997 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"www.mid-day.com\\/articles\\/shoaib-malik-in-line-for-test-comeback...\", \"width\": 670, \"height\": 746, \"thumbnail\": {\"width\": 269, \"height\": 300}, \"imageInsightsToken\": \"ccid_Zf5b8WKD*mid_8045A5C7203C2203C8238D9E00905FCB328BD4D9*simid_608033376034882300\", \"imageId\": \"8045A5C7203C2203C8238D9E00905FCB328BD4D9\", \"accentColor\": \"304987\"}, {\"name\": \"Gallery > Cricketers > Shoaib Malik > Shoaib Malik high quality! Free ...\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=A9FD1ucKtYszoNQZ2KEhYMvgMwvJ6AA5d-DFInyr9I4&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dB7AD00B57D67FD1664C7BBA404FF6E2679019517%26simid%3d608007657767896024&p=DevEx,5030.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api\", \"datePublished\": \"2013-05-18T00:44:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7jwPNSK-kjHNAXQmqBqznMWCB3u4YPz0uHDFoJizw1U&v=1&r=http%3a%2f%2fpak101.com%2fgallery%2fCricketers%2fShoaib_Malik%2f2011%2f9%2f22%2fShoaib_Malik_Picture_9_xmnqf.jpg&p=DevEx,5032.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\ 
+0

あなたのコードを再フォーマットして、インデント – mmenschig

+0

を確認してください不合格データファイルの数行を供給するので、我々は問題を再現することができますしてください。 – Prune

+0

入力がどのように見えるか分からない限り、誰もあなたのコードを検証することはできません。あなたのdata.txtのいくつかの例の行を貼り付けてください – Heri

答えて

0

このコードは2つのアプローチを示しています。最初のものはあなたのもの、2番目のものは正規表現の使用を含むより簡単な方法を示しています。

最初の方法を学ぶ価値がありますが、あなたが解析している文字列の中にあなたの場所を保持することが肝要です。

data = '''webSearchUrl\": \"https:\\/\\/w ... p:\\/\\/www.bing.com"''' 
data = data.replace ('\/', '/') 

print ('Using roughly your approach ...') 

start = 0 
while True: 
    p = data[start:].find('thumbnailUrl') 
    if p == -1: break 
    q = data[start+p+12:].find('http') 
    r = data[start+p+q+12:].find('"') 
    print (data[start+p+q+12:start+p+q+r+12]) 
    start = start+p+q+r+12 

print ('Using a regular expression ...') 

from re import compile 

thumbNailRE = compile(r'thumbnailUrl":\s+"([^"]+)') 
for match in thumbNailRE.findall(data): 
    print (match) 

出力は同じです:

Using roughly your approach ... 
https://tse2.mm.bing.net/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api 
https://tse2.mm.bing.net/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api 
Using a regular expression ... 
https://tse2.mm.bing.net/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api 
https://tse2.mm.bing.net/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api 
https://tse3.mm.bing.net/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api 
+0

ありがとう作品@Bill Bell – user7442628

+0

あなたは大歓迎です! –

関連する問題