大文字の文字列の最初の一致インデックスを見つける最速の方法

私はPython 2.7.13でクイック検索アルゴリズムを実装しました。それは私の望むことをしますが、私は小さなパフォーマンスの問題があります。大文字の文字列の最初の一致インデックスを見つける最速の方法

私はHTML記事のテキストを持っていますが、一般的に5,000〜50,000文字ですが、300,000文字ほどの大きさにすることもできます。
私は特殊文字（é、à、ø、/ ...）とスペースを含むことができる「単語」のリストを持っています。通常は数百から数千語です。単語の長さは2〜256文字です。私は私が唯一の私が持っているもの、それぞれの単語

のための最初の一致を必要とするマッチ

のテキストにインデックスを必要とするHTMLタグ

に含まれたアイテムを無視する必要が

は、この実装である：

def find_indexes(text, words): 
    words_indexes = [] 
    found_words = [] 
    authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')'] 

    text_length = len(text) 

    for j, word in enumerate(words): 
     i = 0 

     # This loop serves to go to the next word find if the first one isn't valid (contained in another word or in HTML tag) 
     while i != -1: 
      i = text.find(word, i + 1) 

      if i + 1 + len(word) < text_length: 

       # We check the before and after character of the word because some words can be contained in others 
       # Like "vision" is in "revision". As well as being contained in HTML tags 
       before = text[i - 1] 
       after = text[i + len(word)] 
       if (before in authorized_characters and 
        after in authorized_characters and not 
        (before == u'.' and after == u'.')): 
        words_indexes.append(i) 
        found_words.append(word) 

        i = -1 

    return words_indexes, found_words

大きな単語リストと大きなテキストではかなり時間がかかります（人間的には大したものではありませんが、これはDjangoビューの一部であるため作成する唯一の治療ではないため、おお

theses 1282 wordsおよびthis 231884 characters long text（Waitbutwhy articleから取られ、処理されています）では、私はコンピュータ上で約0.3秒の実行に達することができます。

しかし、より良い方法があるようにあなたは、このline_profiler

Total time: 0.28045 s 
Function: find_indexes at line 332 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    332           @line_profiler 
    333           def find_indexes(text, words): 
    334   1   4  4.0  0.0  words_indexes = [] 
    335   1   2  2.0  0.0  found_words = [] 
    336   1   2  2.0  0.0  authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')'] 
    337           
    338   1   2  2.0  0.0  text_length = len(text) 
    339           
    340  1283   4362  3.4  0.7  for j, word in enumerate(words): 
    341  1282   1646  1.3  0.3   i = 0 
    342           
    343  3436  11402  3.3  1.8   while i != -1: 
    344  2154  543861 252.5  86.2    i = text.find(word, i + 1) 
    345           
    346  2154  22153  10.3  3.5    if i + 1 + len(word) < text_length: 
    347           
    348               # We check the before and after character of the word because some words can be contained in others 
    349               # Like "vision" is in "revision". As well as being contained in HTML tags 
    350  2154  16388  7.6  2.6     before = text[i - 1] 
    351  2154  19939  9.3  3.2     after = text[i + len(word)] 
    352  2154   7720  3.6  1.2     if (before in authorized_characters and 
    353  531   1468  2.8  0.2      after in authorized_characters and not 
    354  135   278  2.1  0.0      (before == u'.' and after == u'.')): 
    355  135   783  5.8  0.1      words_indexes.append(i) 
    356  135   428  3.2  0.1      found_words.append(word) 
    357           
    358  135   573  4.2  0.1      i = -1 
    359           
    360   1   2  2.0  0.0  return words_indexes, found_words

出典

2017-08-09 Qrom

ここで使用する "インデックス"の目的は何ですか？HTMLドキュメントでは大した意味を持ちません...適切なHTMLパーサーを使用して実際のテキストコンテンツを見ると、それを解析した後にソースで発生した物理的な位置を返すのはかなり面倒です。 –

実際には、テキスト内の特定の単語を強調表示し、そのためにHTMLタグを追加します。タグを配置する場所を知るためにインデックスが必要です。 – Qrom

私はあなたの問題をかなりきちんと解決していると思っています。見つかった言葉が黄色で強調表示された、サンプルテキストファイルからHTML文書を生成します。 –

で見ることができるようにfind()方法は、計算時間のほとんどを取っているので、私は感じてここでHTMLパーサーを使用しての例です（そう、それはフィルタ

import ast 
# regex (not the builtin one) and bs4 need to be pip installed 
import regex 
from bs4 import BeautifulSoup 

# Parse the document so we don't have to worry about HTML stuff 
# and can find actual text content more easily 
with open('text_to_find_the_words.txt') as fin: 
    soup = BeautifulSoup(fin, 'html.parser') 

# Get the words to look at and compile a regex to find them 
# Might already be a list in memory instead of a file. 
with open('list_of_words.txt') as fin: 
    words = ast.literal_eval(fin.read()) 
    matching_words = regex.compile(r'\b(\L<words>)\b', words=words) 

# For each matching text elements, do the highlighting 
for match in soup.find_all(text=matching_words): 
    subbed = matching_words.sub(r'<span style="background: yellow;">\1</span>', match)) 
    match.replace_with(BeautifulSoup(subbed, 'html.parser')) 

# Write the results somewhere (probably to a HttpResponse object in your case) 
with open('results.html', 'w') as fout: 
    fout.write(str(soup))

：属性/タグ内のテキストを見つけ回避するために、ドキュメントからテキスト要素）、コンパイル済みの正規表現）（それは一回ではなく、N何倍も（あなたのメインのボトルネックをループですべての単語をスキャンすることができます）アウト

よ必要に応じて単語を1回だけ強調表示するように調整する必要があります。

出典

2017-08-09 09:52:14

これは面白いです、私は見てみましょう – Qrom

大文字の文字列の最初の一致インデックスを見つける最速の方法

答えて

関連する問題