2016-09-19 4 views
1

私はインターネットをスクロールして2日ほど費やしましたが、私はこの問題を解決できませんでした。私はgraphframes package(バージョン:0.2.0-spark2.0-s_2.11)をPyCharmを使ってスパークを実行するようにインストールしようとしていますが、私の最善の努力にもかかわらず不可能でした。PyCharmでグラフフレームを使用する

私はほとんどすべてを試しました。私はこのサイトhereをチェックして回答を投稿する前に確認してください。ここで

は、私が実行しようとしていますコードです:回答

# IMPORT OTHER LIBS -------------------------------------------------------- 
import os 
import sys 
import pandas as pd 

# IMPORT SPARK ------------------------------------------------------------------------------------# 
# Path to Spark source folder 
USER_FILE_PATH = "/Users/<username>" 
SPARK_PATH = "/PycharmProjects/GenesAssociation" 
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7" 
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE 
os.environ['SPARK_HOME'] = SPARK_HOME 

# Append pySpark to Python Path 
sys.path.append(SPARK_HOME + "/python") 
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip") 

try: 
    from pyspark import SparkContext 
    from pyspark import SparkConf 
    from pyspark.sql import SQLContext 
    from pyspark.graphframes import GraphFrame 

except ImportError as ex: 
    print "Can not import Spark Modules", ex 
    sys.exit(1) 

# GLOBAL VARIABLES --------------------------------------------------------- -----------------------# 
SC = SparkContext('local') 
SQL_CONTEXT = SQLContext(SC) 

# MAIN CODE ---------------------------------------------------------------------------------------# 
if __name__ == "__main__": 

    # Main Path to CSV files 
    DATA_PATH = '/PycharmProjects/GenesAssociation/data/' 
    FILE_NAME = 'gene_gene_associations_50k.csv' 

    # LOAD DATA CSV USING PANDAS -----------------------------------------------------------------# 
    print "STEP 1: Loading Gene Nodes -------------------------------------------------------------" 
    # Read csv file and load as df 
    GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, 
         usecols=['OFFICIAL_SYMBOL_A'], 
         low_memory=True, 
         iterator=True, 
         chunksize=1000) 

    # Concatenate chunks into list & convert to dataFrame 
    GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True)) 

    # Remove duplicates 
    GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first') 

    # Name Columns 
    GENES_DF_CLEAN.columns = ['gene_id'] 

    # Output dataFrame 
    print GENES_DF_CLEAN 

    # Create vertices 
    VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN) 

    # Show some vertices 
    print VERTICES.take(5) 

    print "STEP 2: Loading Gene Edges -------------------------------------------------------------" 
    # Read csv file and load as df 
    EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, 
         usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'], 
         low_memory=True, 
         iterator=True, 
         chunksize=1000) 

    # Concatenate chunks into list & convert to dataFrame 
    EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True)) 

    # Name Columns 
    EDGES_DF.columns = ["src", "dst", "rel_type"] 

    # Output dataFrame 
    print EDGES_DF 

    # Create vertices 
    EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF) 

    # Show some edges 
    print EDGES.take(5) 

    g = gf.GraphFrame(VERTICES, EDGES) 

言うまでもなく、私は火花のpysparkディレクトリに(私が何をしたかを理解することhereを見て)graphframesディレクトリを含む試してみました。しかし、それは十分ではないようです...他に何も私が試したことは失敗しました。これに助けていただければ幸いです。次のエラーメッセージが表示されます。

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
Setting default log level to "WARN". 
To adjust logging level use sc.setLogLevel(newLevel). 
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040.  Attempting port 4041. 

STEP 1: Loading Gene Nodes ------------------------------------------------------------- 
     gene_id 
0   MAP2K4 
1   MYPN 
2   ACVR1 
3   GATA2 
4   RPA2 
5   ARF1 
6   ARF3 
8   XRN1 
9   APP 
10   APLP1 
11  CITED2 
12   EP300 
13   APOB 
14   ARRB2 
15   CSF1R 
16  PRRC2A 
17   LSM1 
18  SLC4A1 
19   BCL3 
20   ADRB1 
21   BRCA1 
25   ARVCF 
26   PCBD1 
27   PSEN2 
28   CAPN3 
29   ITPR1 
30   MAGI1 
31   RB1 
32  TSG101 
33   ORC1 
...   ... 
49379  WDR26 
49380  WDR5B 
49382  NLE1 
49383  WDR12 
49385  WDR53 
49386  WDR59 
49387  WDR61 
49409  CHD6 
49422  DACT1 
49424  KMT2B 
49438 SMARCA1 
49459 DCLRE1A 
49469  F2RL1 
49472  SENP8 
49475  TSPY1 
49479 SERPINB5 
49521  HOXA11 
49548  SYF2 
49553  FOXN3 
49557  MLANA 
49608  REPIN1 
49609  GMNN 
49670 HIST2H2BE 
49767  BCL7C 
49797  SIRT3 
49810  KLF4 
49858  RHO 
49896  MAGEA2 
49907 SUV420H2 
49958  SAP30L 

[6025 rows x 1 columns] 
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB. 
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')] 
STEP 2: Loading Gene Edges ------------------------------------------------------------- 
      src  dst     rel_type 
0  MAP2K4  FLNC    Two-hybrid 
1   MYPN  ACTN2    Two-hybrid 
2  ACVR1  FNTA    Two-hybrid 
3  GATA2  PML    Two-hybrid 
4   RPA2  STAT3    Two-hybrid 
5   ARF1  GGA3    Two-hybrid 
6   ARF3 ARFIP2    Two-hybrid 
7   ARF3 ARFIP1    Two-hybrid 
8   XRN1  ALDOA    Two-hybrid 
9   APP APPBP2    Two-hybrid 
10  APLP1  DAB1    Two-hybrid 
11  CITED2 TFAP2A    Two-hybrid 
12  EP300 TFAP2A    Two-hybrid 
13  APOB  MTTP    Two-hybrid 
14  ARRB2 RALGDS    Two-hybrid 
15  CSF1R  GRB2    Two-hybrid 
16  PRRC2A  GRB2    Two-hybrid 
17  LSM1  NARS    Two-hybrid 
18  SLC4A1 SLC4A1AP    Two-hybrid 
19  BCL3  BARD1    Two-hybrid 
20  ADRB1  GIPC1    Two-hybrid 
21  BRCA1  ATF1    Two-hybrid 
22  BRCA1  MSH2    Two-hybrid 
23  BRCA1  BARD1    Two-hybrid 
24  BRCA1  MSH6    Two-hybrid 
25  ARVCF  CDH15    Two-hybrid 
26  PCBD1 CACNA1C    Two-hybrid 
27  PSEN2  CAPN1    Two-hybrid 
28  CAPN3  TTN    Two-hybrid 
29  ITPR1  CA8    Two-hybrid 
...  ...  ...      ... 
49969 SAP30  HDAC3 Affinity Capture-Western 
49970 BRCA1  RBBP8   Co-localization 
49971 BRCA1  BRCA1  Biochemical Activity 
49972  SET  TREX1   Co-purification 
49973  SET  TREX1  Reconstituted Complex 
49974 PLAGL1  EP300  Reconstituted Complex 
49975 PLAGL1 CREBBP  Reconstituted Complex 
49976 EP300 PLAGL1 Affinity Capture-Western 
49977  MTA1  ESR1  Reconstituted Complex 
49978 SIRT2  EP300 Affinity Capture-Western 
49979 EP300  SIRT2 Affinity Capture-Western 
49980 EP300  HDAC1 Affinity Capture-Western 
49981 EP300  SIRT2  Biochemical Activity 
49982 MIER1 CREBBP  Reconstituted Complex 
49983 SMARCA4  SIN3A Affinity Capture-Western 
49984 SMARCA4  HDAC2 Affinity Capture-Western 
49985  ESR1  NCOA6 Affinity Capture-Western 
49986  ESR1  TOP2B Affinity Capture-Western 
49987  ESR1  PRKDC Affinity Capture-Western 
49988  ESR1  PARP1 Affinity Capture-Western 
49989  ESR1  XRCC5 Affinity Capture-Western 
49990  ESR1  XRCC6 Affinity Capture-Western 
49991 PARP1  TOP2B Affinity Capture-Western 
49992 PARP1  PRKDC Affinity Capture-Western 
49993 PARP1  XRCC5 Affinity Capture-Western 
49994 PARP1  XRCC6 Affinity Capture-Western 
49995 SIRT3  XRCC6 Affinity Capture-Western 
49996 SIRT3  XRCC6  Reconstituted Complex 
49997 SIRT3  XRCC6  Biochemical Activity 
49998 HDAC1  PAX3 Affinity Capture-Western 

[49999 rows x 3 columns] 
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB. 
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')] 
Traceback (most recent call last): 
    File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module> 
    g = gf.GraphFrame(VERTICES, EDGES) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__ 
    self._jvm_gf_api = _java_api(self._sc) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api 
    return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \ 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco 
    return f(*a, **kw) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass. 
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI 
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:211) 
    at java.lang.Thread.run(Thread.java:745) 


Process finished with exit code 1 

ありがとうございます。 >編集構成 - -

答えて

3

あなたはどちらかあなたのコード

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell" 
) 
spark = SparkSession.builder.getOrCreate() 

またはPyCharm編集実行構成(ランPYSPARK_SUBMIT_ARGSを設定することができます>選択して設定 - >選択設定]タブ - >環境変数を選択 - >を追加PYSPARK_SUBMIT_ARGS):

また、あなたの spark-defaults.confpackagesまたは jarsを追加することができ

import os 
import sys 

SPARK_HOME = ... 
os.environ["SPARK_HOME"] = SPARK_HOME 
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config 

sys.path.append(os.path.join(SPARK_HOME, "python")) 
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip")) 

from pyspark.sql import SparkSession 

spark = SparkSession.builder.getOrCreate() 

v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"]) 
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"]) 


from graphframes import * 

g = GraphFrame(v, e) 
g.inDegrees.show() 

spark.stop() 

:最小限の作業例と

enter image description here

graphframes 0.2でPython 3を使用する場合は、JARからPythonライブラリを抽出する際に既知の問題があります。手動で行う必要があります。たとえば、JARファイルをダウンロードして解凍し、graphframesのルートディレクトリがPythonパス上にあることを確認できます。これはgraphframes 0.3に修正されています。

+0

返信いただきありがとうございます。あなたはフォローアップの問題をもう一度見ていただけますか?乾杯。 –

+0

正直言って私はそこで何が起こっているのか分かりません。 https://forums.databricks.com/questions/9530/pyspark-graphframes-init-error.htmlと同じ問題のようですが、私はそれを再現できません。 – zero323

+0

とにかく男。私は単純にパッケージをビルドし、グラフフレームフォルダをpysparkディレクトリ(pycファイルを含む)にコピーするだけで、それを実行することができました。とにかくありがとう! –