Oracle Databaseの部分一致

非常に大きな表（1百万行以上）を持っています。これらの行には、異なるソースからの製品名と価格があります。Oracle Databaseの部分一致

同名の商品が多数ありますが、価格が異なります。ここで

は、我々が行で同じ製品を何度も持っている問題、

であるが、その名前は、私は価格が異なるすべての製品を取得したいの例

Row Product name    price 
----- ----------------------- ---- 
Row 1 : XYZ - size information $a 
Row 2. XYZ -Brand information $b 
Row 3. xyz      $c

のために同じにはなりません。名前が行に同じである場合、私は簡単に自分のために行くことができTable1.Product_Name = Table1.Product_nameとして参加し、Table1.Price！= Table2.Price

しかし、これは、この場合には動作しません:(

ことができますいずれかはそれのためのソリューションを提案し

出典

2011-01-27 onsy

あなたは正しい方向に行くためにregexp_replaceを使用するように試みることができる：？

create table tq84_products (
    name varchar2(50), 
    price varchar2(5) 
);

の3つの製品：

XYZ ABCDは同じ価格で2つのレコードを持っているし、他のすべては異なる価格を持っている

その

ABCD

efghi。

insert into tq84_products values (' XYZ - size information', '$a'); 
insert into tq84_products values ('XYZ - brand information', '$b'); 
insert into tq84_products values ('xyz'     , '$c'); 

insert into tq84_products values ('Product ABCD'   , '$d'); 
insert into tq84_products values ('Abcd is the best'  , '$d'); 

insert into tq84_products values ('efghi is cheap'   , '$f'); 
insert into tq84_products values ('no, efghi is expensive' , '$g');

ストップワードとのselect文は、通常の製品名で発見された単語を削除します。

with split_into_words as (
     select 
     name, 
     price, 
     upper (
     regexp_replace(name, 
          '\W*' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?\W?+' || 
         '(\w+)?'  || 
         '.*', 
         '\' || submatch.counter 
        ) 
     )       word 
     from 
      tq84_products, 
      (select 
       rownum counter 
      from 
       dual 
      connect by 
       level < 10 
      ) submatch 
), 
    stop_words as (
    select 'IS'   word from dual union all 
    select 'BRAND'  word from dual union all 
    select 'INFORMATION' word from dual 
) 
    select 
    w1.price, 
    w2.price, 
    w1.name, 
    w2.name 
-- substr(w1.word, 1, 30)    common_word, 
-- count(*) over (partition by w1.name) cnt 
    from 
    split_into_words w1, 
    split_into_words w2 
    where 
    w1.word = w2.word and 
    w1.name < w2.name and 
    w1.word is not null and 
    w2.word is not null and 
    w1.word not in (select word from stop_words) and 
    w2.word not in (select word from stop_words) and 
    w1.price != w2.price;

これは、そのように

$a $b  XYZ - size information       XYZ - brand information 
$b $c XYZ - brand information       xyz 
$a $c  XYZ - size information       xyz 
$f $g efghi is cheap          no, efghi is expensive

を選択し他の人がいる一方で、ABCDは返されません。

出典

2011-01-27 07:32:35

私はこれを試します。 – onsy

これは便利なのは、「私の場合は単語を止めないでください」という懸念だけです。 – onsy

答えて

関連する問題