2009-07-28 3 views

答えて

20

これを行うには、独自のアナライザクラスを作成する必要があります。これは比較的簡単です。ここに私が使っているものがあります。これはストップワードフィルタリングを組み合わせたものです。ポーター・ステミングと(これはあなたの必要性のためにあまりにも多いかもしれません)文字からアクセントを取り除きます。

/// <summary> 
/// An analyzer that implements a number of filters. Including porter stemming, 
/// Diacritic stripping, and stop word filtering. 
/// </summary> 
public class CustomAnalyzer : Analyzer 
{ 
    /// <summary> 
    /// A rather short list of stop words that is fine for basic search use. 
    /// </summary> 
    private static readonly string[] stopWords = new[] 
    { 
     "0", "1", "2", "3", "4", "5", "6", "7", "8", 
     "9", "000", "$", "£", 
     "about", "after", "all", "also", "an", "and", 
     "another", "any", "are", "as", "at", "be", 
     "because", "been", "before", "being", "between", 
     "both", "but", "by", "came", "can", "come", 
     "could", "did", "do", "does", "each", "else", 
     "for", "from", "get", "got", "has", "had", 
     "he", "have", "her", "here", "him", "himself", 
     "his", "how","if", "in", "into", "is", "it", 
     "its", "just", "like", "make", "many", "me", 
     "might", "more", "most", "much", "must", "my", 
     "never", "now", "of", "on", "only", "or", 
     "other", "our", "out", "over", "re", "said", 
     "same", "see", "should", "since", "so", "some", 
     "still", "such", "take", "than", "that", "the", 
     "their", "them", "then", "there", "these", 
     "they", "this", "those", "through", "to", "too", 
     "under", "up", "use", "very", "want", "was", 
     "way", "we", "well", "were", "what", "when", 
     "where", "which", "while", "who", "will", 
     "with", "would", "you", "your", 
     "a", "b", "c", "d", "e", "f", "g", "h", "i", 
     "j", "k", "l", "m", "n", "o", "p", "q", "r", 
     "s", "t", "u", "v", "w", "x", "y", "z" 
    }; 

    private Hashtable stopTable; 

    /// <summary> 
    /// Creates an analyzer with the default stop word list. 
    /// </summary> 
    public CustomAnalyzer() : this(stopWords) {} 

    /// <summary> 
    /// Creates an analyzer with the passed in stop words list. 
    /// </summary> 
    public CustomAnalyzer(string[] stopWords) 
    { 
     stopTable = StopFilter.MakeStopSet(stopWords);  
    } 

    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) 
    { 
     return new PorterStemFilter(new ISOLatin1AccentFilter(new StopFilter(new LowerCaseTokenizer(reader), stopWords))); 
    } 
} 
+1

ありがとう、私はこれを試してみます。 – devson

+1

+1ジャック、私が探していたものに感謝します。もし私がこれを答えにすることができたら! – andy

+0

あなたの例を使用しましたが、数字の '4656'(標準アナライザが動作します)のクエリで結果が得られません。数字を含まない' StopAnalyzer.ENGLISH_STOP_WORDS'でストップワードを置き換えました。そこに? – Myster

7

SnowballまたはPorterStemFilterを使用できます。 異なるフィルタ/トークナイザ/アナライザを組み合わせる際のガイドとしてJava Analyzer documentationを参照してください。インデクシングと検索に同じアナライザを使用する必要があるので、処理のステミングはインデックス時に開始する必要があります。

+0

ありがとう、私はこれを試してみます。 – devson

関連する問題