Posts

Showing posts with the label web content

Webscraping in R - IMDb ETL Showcase

Image
  Web scraping in R is an ETL pipeline that perform web data mining by reading HTML tags and converting them  to the structured format which can easily be visualized using tidyverse . Let's  scrape movies from IMDb into a data frame in R by invoking the rvest library and then visualize the data frame using ggplot2 and qplot functions: Importing the key R libraries library(rvest) #scraping library(dplyr) #piping library('ggplot2') #plotting Specifying the URL for desired website to be scraped url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature' Reading the HTML code from the website webpage <- read_html(url) Using CSS selectors to scrape the rankings section rank_data_html <- html_nodes(webpage,'.text-primary') Converting the ranking data to text rank_data <- html_text(rank_data_html) Let's have a look at the rankings head(rank_data) [1] "1." "2." "3." "4.&q