XMLスタックエクスチェンジダンプからQ＆Aペアを抽出する

https://archive.org/download/stackexchange、特にダンプのPosts.xmlファイルから質問/回答のペアを抽出したいとします（Animeダンプをかなりランダムに選択しました。上）。このファイルがどのようにレイアウトされているかについて私の理解は、1という質問（質問、タイトル、および他のメタデータの本文を含む）と回答の2の2種類があります（スコア、答えの本文を含みます、およびその他のメタデータ）。XMLスタックエクスチェンジダンプからQ＆Aペアを抽出する

第XMLスニペット PostTypeId="1"内部この行があることを示している

<row Id="8" PostTypeId="2" ParentId="1" CreationDate="2012-12-11T20:47:52.167" Score="60" Body="&lt;p&gt;No, there is a reason why they can't. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;Basically the &lt;a href=&quot;http://onepiece.wikia.com/wiki/New_World&quot;&gt;New World&lt;/a&gt; is beyond the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Red_Line&quot;&gt;Red Line&lt;/a&gt;, but you cannot &quot;walk&quot; on it, or cross it. It's a huge continent, very tall that you cannot go through. You can't cross the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Calm_Belt&quot;&gt;Calm Belt&lt;/a&gt; either, unless you have some form of locomotion such as the Navy or &lt;a href=&quot;http://onepiece.wikia.com/wiki/Boa_Hancock&quot;&gt;Boa Hancock&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;So the only way is to start from one of the Four Seas, then to go the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Reverse_Mountain&quot;&gt;Reverse Mountain&lt;/a&gt; and follow the Grand Line until you reach &lt;em&gt;&lt;a href=&quot;http://onepiece.wikia.com/wiki/Raftel&quot;&gt;Raftel&lt;/a&gt;&lt;/em&gt;, which supposedly is where One Piece is located.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;img src=&quot;http://i.stack.imgur.com/69IZ0.png&quot; alt=&quot;enter image description here&quot;&gt;&lt;/p&gt;&#xA;" OwnerUserId="15" LastEditorUserId="1528" LastEditDate="2013-05-06T19:21:04.703" LastActivityDate="2013-05-06T19:21:04.703" CommentCount="1" />

：

データを容易に十分に、我々は、そのような

<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="69" ViewCount="22384" Body="&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out from the first half and are now sailing across the second half.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wouldn't it have been quicker to set sail in the opposite direction from where they started?  &lt;/p&gt;&#xA;" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="&lt;one-piece&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="2" />

としてエントリを持っている場合、ここで対応する回答は次のようになりに関し質問AcceptedAnswerId="8"は回答のIdを示します。そして、2番目のxmlスニペットでは、質問であるAcceptedAnswerIdであるId="8"と、これが回答であることを示すPostTypeId="2"と、IdであるParentIdがあります。

ここで、このデータを質問/回答のペアで簡単にポーリングできます。理想的には、これを私がこれらの種類のデータ構造に精通しているSQLite3またはMysqlデータベースに変換できると便利です。それが可能でない場合（データベース関数自体を介して、またはデータベースを移植するためのスクリプト化されたラッパーを介して）、Rubyでこのデータを解析すると、titleとbodyの質問を抽出するXML文書全体を調べることができます適切なanswerボディとペアにします。

お時間をいただきありがとうございます。

出典

2017-01-22 randy newfield

スタックExchangeクリエイティブコモンズのデータダンプは、です。Stack ExchangeプロダクションのMicrosoft SQL Serverデータベースからダンプされただけです。したがって、データがSQLデータベースから来ており、実際にがの関係データであることを考慮すると、それを1つに戻すことができます。

Data Dump's READMEにデータベースのスキーマが記述されています。Meta Stack Exchangeにデータベースにインポートするための古いスクリプトがあります。もちろん、SQLのようなリレーショナルクエリインターフェイスだけが必要な場合は、Stack Exchange Data Explorerを使用できます。

出典

2017-01-22 23:15:01

ありがとうございました。私はすべてのデータダンプのQ＆Aペアとプロジェクトのための他のデータを掻きたいので、私はインターフェイスをオンラインで探していません。これには、自分自身がダンプからデータを取り出し、私が説明したように簡単にアクセスできるように格納する方法が必要です。私はそれをDBにインポートしようとします。再度、感謝します。 –

XMLスタックエクスチェンジダンプからQ＆Aペアを抽出する

答えて

関連する問題