Pythonでのデータマイニングのアドバイス

私はPythonで少しのプログラミング経験を持つ生物学者です。私の研究方法の1つは、このデータベースを使用して大きな遺伝子リストをプロファイリングすることです：https://david.ncifcrf.gov/ 出力のキーワード検索を行い、キーワードに関連付けられた遺伝子名を返すことができるかどうかについて誰にでも助言できますか？これは、次のような "テーブル"出力用です：https://david.ncifcrf.gov/annotationReport.jsp?annot=59,12,87,88,30,38,46,3,5,55,53,70,79&currentList=0 バックエンドとapiオプションもあります。すべての洞察と助言は大変ありがとうございます。Pythonでのデータマイニングのアドバイス

出典

2017-01-19 AnnaD

オープンAPIがある場合は、Webスクレイピングを調べる前に、そのAPIを使用するようにしてください。また、https://david.ncifcrf.gov/annotationReport.jsp?annot=59,12,87,88,30,38,46,3,5,55,53,70,79¤tList=0は私のために働いていません。 –

API制限がお客様のニーズを満たしていますか？ 400遺伝子未満で、2048文字にURL入力を制限し、1日に200件のリクエストを処理できますか？あなたが提供したキーワード検索リンクのotehr質問は、あなたが与えた2番目のリンクが壊れています（セッションが期限切れです） –

FYI、動作していないようです。 – Petar

すべてのデータを提供するAPIがあれば、関連するほとんどすべてを自動化できます。 APIはRESTまたはSOAPのいずれかであるため、まず必要なものを見つけ出す必要があります。

APIはRESTfulなの場合：

import urllib2, json 

url = "https://mysuperapiurl.com/api-ws/api/port/" 
u = 'APIUsername' 
p = 'APIPassword' 

def encodeUserData(user, password): 
    return "Basic " + (user + ":" + password).encode("base64").rstrip() 
req = urllib2.Request(url) 
req.add_header('Accept', 'application/json') 
req.add_header("Content-type", "application/x-www-form-urlencoded") 
req.add_header('Authorization', encodeUserData(u, p)) 
res = urllib2.urlopen(req) 
j = json.load(res) # Here is all the data from the API 
json_str= json.dumps(j) # this is the same as above as string

APIがSOAPであれば、それは少し難しくなります。私がお勧めするものはzeepです。サーバーが2.6であるためにそれが不可能であるか、または何人かがそれに取り組んでいるので、sudsを使用してください。

泡と

APIの呼び出しは次のようになります。

import logging, time, requests, re, suds_requests 
from datetime import timedelta,date,datetime,tzinfo 
from requests.auth import HTTPBasicAuth 
from suds.client import Client 
from suds.wsse import * 
from suds import null 
from cStringIO import StringIO 
from bs4 import BeautifulSoup as Soup 

log_stream = StringIO() 
logging.basicConfig(stream=log_stream, level=logging.INFO) 
logging.getLogger('suds.transport').setLevel(logging.DEBUG) 
logging.getLogger('suds.client').setLevel(logging.DEBUG) 

WSDL_URL = 'http://213.166.38.97:8080/SRIManagementWS/services/SRIManagementSOAP?wsdl' 

username='username' 
password='password' 
session = requests.session() 
session.auth=(username, password) 

def addSecurityHeader(client,username,password): 
    security=Security() 
    userNameToken=UsernameToken(username,password) 
    security.tokens.append(userNameToken) 
    client.set_options(wsse=security) 

addSecurityHeader(client,username,password) 

arg1 = "argument_1" 
arg2 = "argument_2" 

try: 
    client.service.GetServiceById(arg1, arg2) 
except TypeNotFound as e: 
    print e 
logresults = log_stream.getvalue()

私は結果を飾り立てるためにbeautifulsoupを使用するので、あなたは見返りに、XMLを受信します：

soup = Soup(logresults) 
print soup.prettify()

[OK]をAPIの接続部分が覆われているので、どこのデータを保存しますか？また、このデータを繰り返してキーワード検索を実行する場所はどこですか？あなたのデータベースに。私はMySQLdbをお勧めします。テーブルを設定し、どの列に格納する情報（APIから収集したもの）について考えてください。あなたは（また、別のSQLを経由して、それを行うことができます）あなたのキーワードを設定し、リスト、辞書、テキストファイルまたはハードコードされたキーワードを使用してデータベースから抽出した結果を比較した場合にどうするかを定義する場所

def dbconnect(): 
    try: 
     db = MySQLdb.connect(
      host='localhost', 
      user='root', 
      passwd='password', 
      db='mysuperdb' 
     ) 
    except Exception as e: 
     sys.exit("Can't connect to database") 
    return db 

def getSQL(): 
    db = dbconnect() 
    cursor = db.cursor() 
    sql = "select * from yoursupertable" 
    dta = cursor.execute(sql) 
    results = cursor.fetchall() 
    return results 

def dataResult(): 
    results = getSQL() 
    for column in results: 
     id = (column[1]) 
print dataResult()

だからこれは彼ら一致など:)

出典

2017-01-27 11:29:55

ありがとうございました。私はそれがどのように進むのかを知らせます。 – AnnaD

この回答が役に立った場合は、投票/回答として受け入れることを検討してください。 –

Pythonでのデータマイニングのアドバイス

答えて

関連する問題