Rのデータフレームで120万行のJSONクエリを高速に処理する方法

私はで約640万のチェックインを行っています。これらのチェックインのユニークな場所は128万です。しかし、gowallaは緯度と経度しか与えません。だから私はそれらのラットとロングのそれぞれの都市、州と国を見つける必要があります。 stackoverflowの別の投稿からのヘルプを使用私は開いているストリートマップを照会し、必要な関連情報を見つける以下のRクエリを作成することができました。Rのデータフレームで120万行のJSONクエリを高速に処理する方法

残念ながら、125行を処理するには約1分かかるため、128万行に数日かかることになります。これらの詳細を見つけるより速い方法がありますか？あるラットとロングの都市名を見つけるためにラットを組み込んだ世界の都市のロングのパッケージがあるかもしれないので、私はオンラインクエリーをする必要はありません。

会場テーブルは、3列のデータフレームです。 2.緯度（緯度） 3.長い1. VID（venueId）（経度）

for(i in 1:nrow(venueTable)){ 
#this is just an indicator to display current value of i on screen 
cat(paste(".",i,".")) 

#Below code composes the url query 
url<-paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=" 
,venueTableTest3$lat[i] 
,"&lon=" 
,venueTableTest3$long[i] 
) 
url<-gsub(' ','',url) 
url<-paste(url) 
x<-fromJSON(url) 
venueTableTest3$display_name[i]<-x$display_name 
venueTableTest3$country[i]<-x$address$country 
}

IがデータフレームとしてJSONクエリの結果であるxを作るRでjsonliteパッケージを使用してい返されるさまざまな結果を格納します。だからx$display_nameまたはx$address$cityを使用して私は私の必須フィールドを使用します。

私のラップトップは、あなたが時間と辛抱しても問題があるとしているWindowsの8

出典

2016-04-10 Asad Feroz Ali

Rでそれを行う方法がわかりませんが、あなたのコードは同期していると思います。つまり、一度に1つのHTTPリクエストしか送信されません。あなたが一度に10のように送ることができるなら、あなたは〜5倍高速を得るかもしれません。 – Dodekeract

を使用して8GBのRAMと120ギガバイトのSSDとコアのi5 3230Mです。あなたが照会しているサービスは、あなたがすでに違反している「毎秒1つの要求の絶対最大値」を許可します。 1.2百万件のクエリに達する前にリクエストを抑制する可能性があります。彼らのウェブサイトでは、より大規模な使用のための同様のAPIに約15kの無料の毎日のリクエストしかありません。

オフラインオプションを使用する方がはるかに良いでしょう。クイック検索では、人口の多い場所の自由に利用できるデータセットが、経度と緯度と共に多く表示されています。ここでは、使用します1です：http://simplemaps.com/resources/world-cities-data

> library(dplyr) 

> cities.data <- read.csv("world_cities.csv") %>% tbl_df 
> print(cities.data) 

Source: local data frame [7,322 x 9] 

      city  city_ascii  lat  lng pop  country iso2 iso3 province 
      (fctr)   (fctr) (dbl) (dbl) (dbl)  (fctr) (fctr) (fctr) (fctr) 
1 Qal eh-ye Now  Qal eh-ye 34.9830 63.1333 2997 Afghanistan  AF AFG Badghis 
2  Chaghcharan Chaghcharan 34.5167 65.2500 15000 Afghanistan  AF AFG  Ghor 
3  Lashkar Gah Lashkar Gah 31.5830 64.3600 201546 Afghanistan  AF AFG Hilmand 
4   Zaranj   Zaranj 31.1120 61.8870 49851 Afghanistan  AF AFG Nimroz 
5  Tarin Kowt  Tarin Kowt 32.6333 65.8667 10000 Afghanistan  AF AFG Uruzgan 
6 Zareh Sharan Zareh Sharan 32.8500 68.4167 13737 Afghanistan  AF AFG Paktika 
7  Asadabad  Asadabad 34.8660 71.1500 48400 Afghanistan  AF AFG Kunar 
8   Taloqan  Taloqan 36.7300 69.5400 64256 Afghanistan  AF AFG Takhar 
9 Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333 7407 Afghanistan  AF AFG Kapisa 
10  Mehtar Lam  Mehtar Lam 34.6500 70.1667 17345 Afghanistan  AF AFG Laghman 
..   ...   ...  ...  ... ...   ... ... ...  ...

それは実際のデータ例なしで証明するのは難しいが、我々はいくつかのおもちゃのデータを作ることができます（提供するために便利！）。 geosphereでdistm機能を使用して

# make up toy data 
> candidate.longlat <- data.frame(vid = 1:3, 
           lat = c(12.53, -16.31, 42.87), 
           long = c(-70.03, -48.95, 74.59))

、我々は一度にすべてのデータおよびすべての都市の位置の間の距離を計算することができます。あなたのために、これは〜8,400,000,000の数字を含むmatrixになるので、しばらく時間がかかるかもしれません（パラレル化を探ることができます）。

> install.packages("geosphere") 
> library(geosphere) 

# compute distance matrix using geosphere 
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")], 
         y = cities.data[,c("lng", "lat")])

それはあなたのデータポイントのそれぞれに最も近い都市を見つけるために、その後は簡単だし、あなたのdata.frameからcbindそれが。

# work out which index in the matrix is closest to the data 
> closest.index <- apply(distance.matrix, 1, which.min) 

# rbind city and country of match with original query 
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")]) 
> print(candidate.longlat) 

    vid lat long  city country 
1 1 12.53 -70.03 Oranjestad  Aruba 
2 2 -16.31 -48.95 Anapolis  Brazil 
3 3 42.87 74.59 Bishkek Kyrgyzstan

出典

2016-04-10 09:17:19

これはいくつかの説明です！そのような詳細な対応に感謝します。 –

そうでした！ありがとう、トン。 –

distance関数は指定されていないときにデフォルトで 'distm'を使用しますか？ –

ここでRの固有空間処理能力を使用して別の方法です：ので、あなたが一日対数時間見ているそれは〜7500のために（私のシステム上の）分についてです

library(sp) 
library(rgeos) 
library(rgdal) 

# world places shapefile 
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip" 
fil1 <- basename(URL1) 
if (!file.exists(fil1)) download.file(URL1, fil1) 
unzip(fil1) 

places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places", 
        stringsAsFactors=FALSE) 

# some data from the other answer since you didn't provide any 
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv" 
fil2 <- basename(URL2) 
if (!file.exists(fil2)) download.file(URL2, fil2) 

# we need the points from said dat 
dat <- read.csv(fil2, stringsAsFactors=FALSE) 
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places))) 

# this is not necessary 
# I just don't like the warning about longlat not being a real projection 
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" 
pts <- spTransform(pts, robin) 
places <- spTransform(places, robin) 

# compute the distance (makes a pretty big matrix so you should do this 
# in chunks unless you have a ton of memory or do it row-by-row 
far <- gDistance(pts, places, byid=TRUE) 

# find the closest one 
closest <- apply(far, 1, which.min) 

# map to the fields (you may want to map to other fields) 
locs <- [email protected][closest, c("NAME", "ADM1NAME", "ISO_A2")] 

locs[sample(nrow(locs), 10),] 

##    NAME  ADM1NAME ISO_A2 
## 3274  Szczecin West Pomeranian  PL 
## 1039  Balakhna  Nizhegorod  RU 
## 1012  Chitre   Herrera  PA 
## 3382  L'Aquila   Abruzzo  IT 
## 1982  Dothan   Alabama  US 
## 5159 Bayankhongor  Bayanhongor  MN 
## 620  Deming  New Mexico  US 
## 1907 Fort Smith  Arkansas  US 
## 481  Dedougou  Mou Houn  BF 
## 7169  Prague   Prague  CZ

かもっと。あなたはこれを並行して実行することができ、1時間以内に完了させることができます。

国または管理1ポリゴンの非常に軽量なシェイプファイルを使用して、2番目のプロセスを使用して、それらの地理的な場所のより良い解決ポイントから遠ざけることができます。

出典

2016-04-10 13:11:39 hrbrmstr

Rのデータフレームで120万行のJSONクエリを高速に処理する方法

答えて

関連する問題