2016-04-10 20 views
1

の各行の話速をと計算する必要があります。 SRT(字幕)ファイルの内容は次のようになります。たとえばR:srt(字幕)ファイルから時間を抽出する

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, plus debate and analysis. 

3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect the pay of millions 

、それは10ワード「自由民主党は、数百万人の賃金を守ることを約束する」と言って4秒989ミリ秒を取ります。これらの10単語の平均発話速度は、で498.9ミリ秒です。私は以下のような行としてサブタイトルの列と行としてたstartTimeendTimeはtextStringWORDCOUNTとのデータフレームを持つことができるように、私はSRTファイルを読み込むにはどうすればよい

?私はRでたendTimeからのstartTimeを引くにはどうすればよい

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000") 

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989") 

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions") 

wordCount<-c(12,10,10) 

rate.df<-data.frame(startTime, endTime, textString, wordCount) 

、時間が時間の形で提示されたとき:分:秒、ミリ秒?

+0

私はMS Excelを使用してタスクに成功しましたが、私はExcelを使用するにはあまりにも多くのデータを持っていますこの仕事のために。 – Ninjacat

答えて

2

ここで可能な解決策があります(コードはかなり自明です):

text=" 

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, 
plus debate 
and analysis. 



3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect 
the pay of millions" 

con<-textConnection(text) 
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and 
# they should be replaced by the following single line in the real case 
# lines <- readLines(srtFileName) 

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){ 
    block <- lines[blockIdx] 
    block <- block[!grepl("^\\s*$",block)] 
    if(length(block) == 0){ 
     return(NULL) 
    } 
    if(length(block) < 3){ 
     warning("a block not respecting srt standards has been found") 
    } 
    return(data.frame(id=block[1], 
         times=block[2], 
         textString=paste0(block[3:length(block)],collapse="\n"), 
         stringsAsFactors = FALSE)) 
    }) 
m <- do.call(rbind,listOfEntries) 


# split start and end times 
tmp <- do.call(rbind,strsplit(m[,'times'],' --> ')) 
m$startTime <- tmp[,1] 
m$endTime <- tmp[,2] 

# parse times 
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric)) 
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000) 

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric)) 
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000) 

# compute time difference in seconds 
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds 

# word count 
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1 

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. : 
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1 

m$millisecsPerWord <- m$timeDiffInSecs * 1000/m$wordCount 

結果:

> m 
    id       times                textString 
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you 
3 2 00:00:22,000 --> 00:00:23,989  the latest from the campaign trail, \nplus debate \nand analysis. 
6 3 00:00:24,000 --> 00:00:28,989   The Liberal Democrats promise to protect \nthe pay of millions 
    startTime  endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord 
2 00:00:19,000 00:00:21,989   19 21.989   2.989  14   213.5000 
3 00:00:22,000 00:00:23,989   22 23.989   1.989  11   180.8182 
6 00:00:24,000 00:00:28,989   24 28.989   4.989  10   498.9000 
+1

ああ。それは素晴らしいです!本当にありがとう、digEmAll!コードはちょうど美しいです! – Ninjacat

+0

ありがとう、@digemall – Ninjacat

関連する問題