大きなSPARQLデータセットのインデックス作成を改善するにはどうすればよいですか？

Mark Logic（8.0-6.4）で実行するのに非常に長い時間（10秒）かかる非常に単純なSPARQLクエリです。私はそれをスピードアップするために何ができますか？大きなSPARQLデータセットのインデックス作成を改善するにはどうすればよいですか？

データはジオネームのサブセットに基づいており、同じオーダーの大きさです（約2200万トリプルのように見えます）。

PREFIX gj: <http://mycompany.com/geonames-jurisdiction/1.0/schema#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX gn: <http://www.geonames.org/ontology#> 

SELECT * 
FROM <http://mycompany.com/geonames-jurisdiction/1.0/data> 
FROM <http://mycompany.com/geonames-jurisdiction/1.0/rule-data> 
WHERE 
    { ?this_0 rdf:type gj:LocalCounty ; 
      gn:name ?name_1 . 
    } 
ORDER BY ASC(?name_1) 
LIMIT 100

更新

パーMarkLogicの提案、私は地元の郡に固有のDBに新しいプロパティを挿入し、クエリを実行しました：

INSERT { 
    GRAPH <http://mycompany.com/geonames-jurisdiction/1.0/rule-data> { 
    ?this gj:localCountyName ?name . 
    } 
} 
WHERE { 
    ?this a gj:LocalCounty . 
    ?this gn:name ?name . 
}

私はいくつか提案されたクエリの修正をもなされています：

PREFIX gj: <http://mycompany.com/geonames-jurisdiction/1.0/schema#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX gn: <http://www.geonames.org/ontology#> 

SELECT ?this_0 ?name_1 
FROM <http://mycompany.com/geonames-jurisdiction/1.0/data> 
FROM <http://mycompany.com/geonames-jurisdiction/1.0/rule-data> 
WHERE 
    { ?this_0 rdf:type gj:LocalCounty ; 
      gj:localCountyName ?name_1 . 
    } 
ORDER BY ?name_1 
LIMIT 20

総問合せ時間が〜4秒に短縮されますが、これは優れていますが、まだまだ巨大です。上記のクエリから

トレース情報：ハードウェア（メモリ、CPU、ディスク）に応じて、

2017-05-04 12:00:18.684 Info: <triple-value-statistics count="147540458" unique-subjects="25064012" unique-predicates="81" unique-objects="67600843" xmlns="cts:triple-value-statistics"> 
2017-05-04 12:00:18.684 Info: <triple-value-entries> 
2017-05-04 12:00:18.684 Info:  <triple-value-entry count="8385355"> 
2017-05-04 12:00:18.684 Info:  <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value> 
2017-05-04 12:00:18.684 Info:  <subject-statistics count="0" unique-predicates="0" unique-objects="0"/> 
2017-05-04 12:00:18.684 Info:  <predicate-statistics count="8356279" unique-subjects="8341989" unique-objects="13"/> 
2017-05-04 12:00:18.684 Info:  <object-statistics count="0" unique-subjects="0" unique-predicates="0"/> 
2017-05-04 12:00:18.684 Info:  </triple-value-entry> 
2017-05-04 12:00:18.684 Info:  <triple-value-entry count="29204"> 
2017-05-04 12:00:18.684 Info:  <triple-value>http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty</triple-value> 
2017-05-04 12:00:18.684 Info:  <subject-statistics count="2" unique-predicates="2" unique-objects="2"/> 
2017-05-04 12:00:18.684 Info:  <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/> 
2017-05-04 12:00:18.684 Info:  <object-statistics count="29202" unique-subjects="29202" unique-predicates="3"/> 
2017-05-04 12:00:18.684 Info:  </triple-value-entry> 
2017-05-04 12:00:18.684 Info:  <triple-value-entry count="29201"> 
2017-05-04 12:00:18.684 Info:  <triple-value>http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName</triple-value> 
2017-05-04 12:00:18.684 Info:  <subject-statistics count="0" unique-predicates="0" unique-objects="0"/> 
2017-05-04 12:00:18.684 Info:  <predicate-statistics count="29201" unique-subjects="29201" unique-objects="26692"/> 
2017-05-04 12:00:18.684 Info:  <object-statistics count="0" unique-subjects="0" unique-predicates="0"/> 
2017-05-04 12:00:18.684 Info:  </triple-value-entry> 
2017-05-04 12:00:18.684 Info: </triple-value-entries> 
2017-05-04 12:00:18.684 Info: </triple-value-statistics> 
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525 
2017-05-04 12:00:18.684 Info: initialPlan=SPARQLModule[ 
2017-05-04 12:00:18.684 Info: Prolog[] 
2017-05-04 12:00:18.684 Info: SPARQLSelect[SPARQLLimit[ 
2017-05-04 12:00:18.684 Info:  LIMIT GraphNode[Literal "20"^^<http://www.w3.org/2001/XMLSchema#integer>] 
2017-05-04 12:00:18.684 Info:  SPARQLProject[order(1) 
2017-05-04 12:00:18.684 Info:   GraphNode[Var this_0 0] 
2017-05-04 12:00:18.684 Info:   GraphNode[Var name_1 1] 
2017-05-04 12:00:18.684 Info:   SPARQLOrder[order(1) UNSORTED 
2017-05-04 12:00:18.684 Info:   OrderSpec[ 
2017-05-04 12:00:18.684 Info:    Variable[QName[(Unknown) name_1] 1] 
2017-05-04 12:00:18.684 Info:    ASCENDING EMPTY MIN] 
2017-05-04 12:00:18.684 Info:   SPARQLMergeJoin[order(0) hash(0==0) scatter() 
2017-05-04 12:00:18.684 Info:    TriplePattern[order(0,1) PSO 
2017-05-04 12:00:18.684 Info:    GraphNode[Var this_0 0] 
2017-05-04 12:00:18.684 Info:    GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName>] 
2017-05-04 12:00:18.684 Info:    GraphNode[Var name_1 1]] 
2017-05-04 12:00:18.684 Info:    TriplePattern[order(0) OPS 
2017-05-04 12:00:18.684 Info:    GraphNode[Var this_0 0] 
2017-05-04 12:00:18.684 Info:    GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>] 
2017-05-04 12:00:18.684 Info:    GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty>]]]]]]]] 
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=9 seed=15212683942933123635 
2017-05-04 12:00:18.684 Info: initialCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20]) 
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=0 
2017-05-04 12:00:18.726 Info: cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20]) 
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=1 
2017-05-04 12:00:18.726 Info: cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20]) 
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=2 
2017-05-04 12:00:18.728 Info: cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20]) 
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 
2017-05-04 12:00:18.728 Info: bestCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20]) 
2017-05-04 12:00:18.729 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525 
2017-05-04 12:00:18.729 Info: plan=SPARQLModule[ 
2017-05-04 12:00:18.729 Info: Prolog[] 
2017-05-04 12:00:18.729 Info: SPARQLSelect[SPARQLLimit[ 
2017-05-04 12:00:18.729 Info:  LIMIT GraphNode[Literal "20"^^<http://www.w3.org/2001/XMLSchema#integer>] 
2017-05-04 12:00:18.729 Info:  SPARQLProject[order(1) 
2017-05-04 12:00:18.729 Info:   GraphNode[Var this_0 0] 
2017-05-04 12:00:18.729 Info:   GraphNode[Var name_1 1] 
2017-05-04 12:00:18.729 Info:   SPARQLOrder[order(1) UNSORTED 
2017-05-04 12:00:18.729 Info:   OrderSpec[ 
2017-05-04 12:00:18.729 Info:    Variable[QName[(Unknown) name_1] 1] 
2017-05-04 12:00:18.729 Info:    ASCENDING EMPTY MIN] 
2017-05-04 12:00:18.729 Info:   SPARQLMergeJoin[order(0) hash(0==0) scatter() 
2017-05-04 12:00:18.729 Info:    TriplePattern[order(0,1) PSO 
2017-05-04 12:00:18.729 Info:    GraphNode[Var this_0 0] 
2017-05-04 12:00:18.729 Info:    GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName>] 
2017-05-04 12:00:18.729 Info:    GraphNode[Var name_1 1]] 
2017-05-04 12:00:18.729 Info:    TriplePattern[order(0) OPS 
2017-05-04 12:00:18.729 Info:    GraphNode[Var this_0 0] 
2017-05-04 12:00:18.729 Info:    GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>] 
2017-05-04 12:00:18.729 Info:    GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty>]]]]]]]]

出典

2017-05-02 RMorrisey

クエリと一致する結果の総数はどのくらいですか？ 'ORDER BY'を使わない場合のパフォーマンスはどうですか？これは、基本的にグラフパターンに一致するデータ全体を実行する必要があるためです。 – AKSW

LIMIT節を削除してカウントを行うと、〜29,000トリプルがカウントされます。 – RMorrisey

Ok、「ODER BY」を削除するのはどうですか？これははるかに速くなければなりません。 – AKSW

、あなたは森林の数を増やすことで、パフォーマンスを向上させることができます。

出典

2017-05-03 14:39:00

MarkLogicはスケールアウトアーキテクチャを使用しているため、1台のマシンでスケーラブルなパフォーマンスを保証するものはありません。スケールする最善の方法は、より多くのノード、具体的にはそれぞれに適切なメモリを持つeノードを追加することです。

出典

2017-09-06 14:33:19 scotthenninger

大きなSPARQLデータセットのインデックス作成を改善するにはどうすればよいですか？

答えて

関連する問題