オンラインサイトcgiからのデータスクラップR

目的：特定の潮力発電所（以下の例を参照）で年間全体の潮汐予測データを取得します。オンラインサイトcgiからのデータスクラップR

試してみました：天気データについては、this exchangeなど、さまざまな投稿のヒントが最もよく似ています。私は、私が望むデータを格納しているサイトがCGIであることに気づいた。パラメータを選択すると、それらのパラメータはリンクアドレスに反映されません。私はデータスクレイピングのためにこれを扱うことに全く慣れていません。

library(RCurl) 
url <- "http://tbone.biol.sc.edu/tide/tideshow.cgi?site=South+Beach%2C+Yaquina+Bay%2C+Oregon&units=f" 

s <- getURL(url) 
s <- gsub("<br>\n", s) 
dat <- read.csv(con <- textConnection(s))

これは実際に私に製品を与えた最初のコードですが、テーブルのデータではありません。理想的には、オプションを選択したいと考えています（たとえば、1年、1月1日に開始日を設定します）。私はこれをやったことがないし、このタイプのサイトでどのツールを使うべきかを知るために、HTMLプログラミングや開発について多くのことを知らない。

出典

2017-12-15 KVininska

そのサイトは本当に誰もそれを掻き立てることを望まない（http://tbone.biol.sc.edu/robots.txt）。しかし、[それが表示しているソフトウェア]（http://www.flaterco.com/xtide/index.html）は無料です。 – alistaire

https://cran.r-project.org/web/packages/TideTables/index.html＆https://beckmw.wordpress.com/2017/04/12/predicting-tides-in-r/＆http： //lukemiller.org/index.php/2016/09/rtide-ar-package-for-predicting-tide-heights-us-locations-only-currently/＆... OH、特にhttps://github.com/poissonconsulting//これらのリンクを見たのは – hrbrmstr

です。私は潮の予測には興味がありません。このサイトで私のために行われます。このサイトは、旅行を計画するときに「典型的な収穫機」が使用するものとして使用しているので重要です。私はcsvとしてエクスポートすることができましたが、目標はサイトから直接情報を収集することでした。 – KVininska

同僚の助けを借りて、GUIベースの.cgiサイトからいくつかの基準に基づいて複数のサイトのデータをスクレイピングするためのコードです。

複数のサイト（ハイパーリンク）が一覧表示されているメインWebサイトに戻って、必要なものを選択し、GUIで選択された条件を適用してから、データフレーム。

library(rvest) 
library(plyr) 
library(dplyr) 
library(stringr) 

#define base url for region (ie site where multiple locations are) 
url <- "http://tbone.biol.sc.edu/tide/sites_uswest.html" 

#read html from page and extract hyperlinks 
#view url to see list of links for multiple locations 
l <- url %>%read_html()%>% 
    html_nodes("a") %>% html_attr("href") 

# grep only tideshow pattern to get vector of site links 
# grep allows filtering/subsetting using a partial string 
sites <- l[grep("*tideshow*", l)] 

# remove everything before 'site=' to get correct formatting for url site names 
sites <- gsub(".*site=", "", sites) 

#generate vector of sites of interest 
#don't need to use regex to create the vector; 
    #you can manipulate the list of sites however you prefer 
    #here, used | for "or" value for selecting multiple sites at once 
sites <- sites[grep("(Waldport\\%2C\\+Alsea|South\\+Beach\\%2C\\+Yaquina|Charleston\\%2C\\+Oregon)(?!.*\\%282\\%29)", sites, perl=TRUE)] 

#define starting date of data 
year <- "2016" 
month <- "01" 
day <- "01" 

#define number of days for prediction 
numberofdays = 366 +365 #no. of days in 2016 + no. days in 2017 

# lapply through the site vector, x represents site. 
# This will pull data from each site in the vector "sites", and bind it together in a list 
o <- lapply(sites, function(x){ 

    # paste together file path using generalized cgi address and defined parameters 
    path<- paste0("http://tbone.biol.sc.edu/tide/tideshow.cgi?type=table;tplotdir=horiz;gx=640;gy=240;caltype=ndp;interval=00%3A01;glen=", 
       numberofdays , 
       ";fontsize=%2B0;units=feet;", 
       "year=", year, ";month=", month, ";day=", day, 
       ";hour=00;min=01;", 
       "killsun=1;tzone=local;ampm24=24;colortext=black;colordatum=white;colormsl=yellow;colortics=red;colorday=skyblue;colornight=deep-%3Cbr%20%2F%3Eskyblue;colorebb=seagreen;colorflood=blue;site=", 
       x, 
       ";d_year=;d_month=Jan;d_day=01;d_hour=00;d_min=00" 
) 

    # use ReadLines to bring in table from each file. 
    d <- readLines(path, warn=FALSE) 

    # extract site name 
    site <- str_extract(string = d[grep("<h2>", d)][1], pattern = "(?<=<h2>)(.*?)(?=</h2>)") 

    # extract coordinates 
    coord <- gsub(".*<pre>", "", d[grep("<h2>", d)][1]) 

    # get tide data lines 
    data <- d[grep("\\d{4}[-]\\d{1,2}[-]\\d{1,2}", d) ] 

    # bind columns together 
    all <- cbind(site,coord, data) 
}) 

# bind data.frame from list 
df <- ldply(o, rbind.data.frame) 

# bind site and coordinate columns with split data columns 
tides <- cbind(df[c(1,2)] , str_split_fixed(df$data, "\\s+", 6)) 
names(tides) <- c("site", "coordinates", "date", "time", "tz", "depth", "units", "tide") 
head(tides) 
str(tides) 
summary(tides)

出典

2017-12-19 16:37:57 KVininska

オンラインサイトcgiからのデータスクラップR

答えて

関連する問題