Chapter 24: Web scraping

Author

Andrew Pua

Learning objectives

  1. Learn a bit of HTML
  2. Use that bit of HTML to learn web scraping.

Curious features of the chapter

  • No exercises!
  • Ethical implications of web scraping
  • Using an external tool like SelectorGadget
  • Limitations of scraping for dynamic websites

Typical HTML structure

  • HTML has hierarchical structure.
  • Structure is composed of elements.
  • Each element has
    • Start tag
    • Attributes
    • Content
    • End tag
  • Each element can have children which are themselves elements.
  • Consistency of structure enables one to do web scraping.

Key commands to extract data from HTML

  • Key package: rvest
  • Load html to scrape: read_html()
  • Extract data through selectors
    • html_elements(), html_element()
    • html_attr()
    • html_text2()
  • Extract data from HTML tables: html_table()

Example: Loading IMDB data

  • Load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
  • This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.
# Problem directly using code from book
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
  • Download HTML file locally instead. Then use read_html() directly on the downloaded file.
# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
html.book <- read_html("scrapedpage.html")
  • Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.

  • May encounter 403 Forbidden error.

# Not just any URL will work
url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  • But even if that were the case, the scraped file may part of a dynamic website.
  • A very old snapshot is available for comparison.
url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)
  • But the structure of the website and the associated HTML have changed.
  • This means the code you see in the book will not work out of the box.
html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
temp <- html.old |> 
  html_elements("table") |> html_attr("border") 
which(temp == "1")
[1] 21
temp <- html.old |> 
  html_elements("table") 
table.old <- html_table(temp[21], header=TRUE)
ratings <- table.old[[1]] |>
  select(
    rank = "Rank",
    title_year = "Title",
    rating = "Rating",
    votes = "Votes"
  ) |> 
  mutate(votes = parse_number(votes)) |>
  separate_wider_regex(
    title_year,
    patterns = c(
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  )
ratings
# A tibble: 250 × 5
    rank title                                              year  rating  votes
   <dbl> <chr>                                              <chr>  <dbl>  <dbl>
 1     1 Godfather, The                                     1972     9    97616
 2     2 Shawshank Redemption, The                          1994     8.9 120632
 3     3 Lord of the Rings: The Return of the King, The     2003     8.9  66781
 4     4 Godfather: Part II, The                            1974     8.8  57761
 5     5 Schindler's List                                   1993     8.7  82282
 6     6 Shichinin no samurai                               1954     8.7  24675
 7     7 Casablanca                                         1942     8.7  55763
 8     8 Lord of the Rings: The Two Towers, The             2002     8.7  88323
 9     9 Lord of the Rings: The Fellowship of the Ring, The 2001     8.7 130513
10    10 Star Wars                                          1977     8.7 114510
# ℹ 240 more rows
  • Let us compare to the book.
html.book <- read_html("scrapedpage.html")
table.book <- html.book |> 
  html_element("table") |> 
  html_table()
ratings.book <- table.book |>
  select(
    rank_title_year = `Rank & Title`,
    rating = `IMDb Rating`
  ) |> 
  mutate(
    rank_title_year = str_replace_all(rank_title_year, "\n +", " "), 
    rating_n = html.book |> html_elements("td strong") |> html_attr("title")
  ) |> 
  separate_wider_regex(
    rank_title_year,
    patterns = c(
      rank = "\\d+", "\\. ",
      title = ".+", " +\\(",
      year = "\\d+", "\\)"
    )
  ) |>
  separate_wider_regex(
    rating_n,
    patterns = c(
      "[0-9.]+ based on ",
      number = "[0-9,]+",
      " user ratings"
    )
  ) |>
    mutate(
    number = parse_number(number)
  )
ratings.book$title
  [1] "The Shawshank Redemption"                                            
  [2] "The Godfather"                                                       
  [3] "The Godfather: Part II"                                              
  [4] "The Dark Knight"                                                     
  [5] "12 Angry Men"                                                        
  [6] "Schindler's List"                                                    
  [7] "The Lord of the Rings: The Return of the King"                       
  [8] "Pulp Fiction"                                                        
  [9] "The Good, the Bad and the Ugly"                                      
 [10] "The Lord of the Rings: The Fellowship of the Ring"                   
 [11] "Fight Club"                                                          
 [12] "Forrest Gump"                                                        
 [13] "Inception"                                                           
 [14] "The Lord of the Rings: The Two Towers"                               
 [15] "Star Wars: Episode V - The Empire Strikes Back"                      
 [16] "The Matrix"                                                          
 [17] "Goodfellas"                                                          
 [18] "One Flew Over the Cuckoo's Nest"                                     
 [19] "Seven Samurai"                                                       
 [20] "Se7en"                                                               
 [21] "The Silence of the Lambs"                                            
 [22] "City of God"                                                         
 [23] "It's a Wonderful Life"                                               
 [24] "Life Is Beautiful"                                                   
 [25] "Spider-Man: No Way Home"                                             
 [26] "Saving Private Ryan"                                                 
 [27] "Star Wars: Episode IV - A New Hope"                                  
 [28] "Interstellar"                                                        
 [29] "Spirited Away"                                                       
 [30] "The Green Mile"                                                      
 [31] "Parasite"                                                            
 [32] "Léon: The Professional"                                              
 [33] "Hara-Kiri"                                                           
 [34] "The Pianist"                                                         
 [35] "Terminator 2: Judgment Day"                                          
 [36] "Back to the Future"                                                  
 [37] "The Usual Suspects"                                                  
 [38] "Psycho"                                                              
 [39] "The Lion King"                                                       
 [40] "Modern Times"                                                        
 [41] "Grave of the Fireflies"                                              
 [42] "American History X"                                                  
 [43] "Whiplash"                                                            
 [44] "Gladiator"                                                           
 [45] "City Lights"                                                         
 [46] "The Departed"                                                        
 [47] "The Intouchables"                                                    
 [48] "The Prestige"                                                        
 [49] "Casablanca"                                                          
 [50] "Once Upon a Time in the West"                                        
 [51] "Rear Window"                                                         
 [52] "Cinema Paradiso"                                                     
 [53] "Alien"                                                               
 [54] "Apocalypse Now"                                                      
 [55] "Memento"                                                             
 [56] "Indiana Jones and the Raiders of the Lost Ark"                       
 [57] "The Great Dictator"                                                  
 [58] "Django Unchained"                                                    
 [59] "The Lives of Others"                                                 
 [60] "Paths of Glory"                                                      
 [61] "Sunset Blvd."                                                        
 [62] "WALL·E"                                                              
 [63] "Avengers: Infinity War"                                              
 [64] "Witness for the Prosecution"                                         
 [65] "Spider-Man: Into the Spider-Verse"                                   
 [66] "The Shining"                                                         
 [67] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
 [68] "Princess Mononoke"                                                   
 [69] "Oldboy"                                                              
 [70] "Joker"                                                               
 [71] "Your Name."                                                          
 [72] "Coco"                                                                
 [73] "The Dark Knight Rises"                                               
 [74] "Aliens"                                                              
 [75] "Once Upon a Time in America"                                         
 [76] "Avengers: Endgame"                                                   
 [77] "Capernaum"                                                           
 [78] "Das Boot"                                                            
 [79] "High and Low"                                                        
 [80] "3 Idiots"                                                            
 [81] "Toy Story"                                                           
 [82] "Amadeus"                                                             
 [83] "American Beauty"                                                     
 [84] "Braveheart"                                                          
 [85] "Inglourious Basterds"                                                
 [86] "Good Will Hunting"                                                   
 [87] "Hamilton"                                                            
 [88] "Star Wars: Episode VI - Return of the Jedi"                          
 [89] "Come and See"                                                        
 [90] "2001: A Space Odyssey"                                               
 [91] "Reservoir Dogs"                                                      
 [92] "Like Stars on Earth"                                                 
 [93] "Vertigo"                                                             
 [94] "M"                                                                   
 [95] "The Hunt"                                                            
 [96] "Citizen Kane"                                                        
 [97] "Requiem for a Dream"                                                 
 [98] "Singin' in the Rain"                                                 
 [99] "North by Northwest"                                                  
[100] "Eternal Sunshine of the Spotless Mind"                               
[101] "Ikiru"                                                               
[102] "Bicycle Thieves"                                                     
[103] "Lawrence of Arabia"                                                  
[104] "The Kid"                                                             
[105] "Full Metal Jacket"                                                   
[106] "Incendies"                                                           
[107] "The Apartment"                                                       
[108] "Dangal"                                                              
[109] "Double Indemnity"                                                    
[110] "Metropolis"                                                          
[111] "A Separation"                                                        
[112] "The Father"                                                          
[113] "Taxi Driver"                                                         
[114] "A Clockwork Orange"                                                  
[115] "The Sting"                                                           
[116] "Scarface"                                                            
[117] "Snatch"                                                              
[118] "1917"                                                                
[119] "Amélie"                                                              
[120] "To Kill a Mockingbird"                                               
[121] "Toy Story 3"                                                         
[122] "For a Few Dollars More"                                              
[123] "Up"                                                                  
[124] "Pather Panchali"                                                     
[125] "Indiana Jones and the Last Crusade"                                  
[126] "Heat"                                                                
[127] "L.A. Confidential"                                                   
[128] "Ran"                                                                 
[129] "Yojimbo"                                                             
[130] "Die Hard"                                                            
[131] "Green Book"                                                          
[132] "Rashomon"                                                            
[133] "Downfall"                                                            
[134] "All About Eve"                                                       
[135] "Monty Python and the Holy Grail"                                     
[136] "Some Like It Hot"                                                    
[137] "Batman Begins"                                                       
[138] "Unforgiven"                                                          
[139] "Children of Heaven"                                                  
[140] "Jai Bhim"                                                            
[141] "Howl's Moving Castle"                                                
[142] "The Wolf of Wall Street"                                             
[143] "Judgment at Nuremberg"                                               
[144] "There Will Be Blood"                                                 
[145] "Casino"                                                              
[146] "The Great Escape"                                                    
[147] "The Treasure of the Sierra Madre"                                    
[148] "Pan's Labyrinth"                                                     
[149] "A Beautiful Mind"                                                    
[150] "The Secret in Their Eyes"                                            
[151] "Raging Bull"                                                         
[152] "Chinatown"                                                           
[153] "My Neighbor Totoro"                                                  
[154] "Shutter Island"                                                      
[155] "Lock, Stock and Two Smoking Barrels"                                 
[156] "No Country for Old Men"                                              
[157] "Klaus"                                                               
[158] "Dial M for Murder"                                                   
[159] "The Thing"                                                           
[160] "The Gold Rush"                                                       
[161] "Three Billboards Outside Ebbing, Missouri"                           
[162] "The Seventh Seal"                                                    
[163] "The Elephant Man"                                                    
[164] "Dersu Uzala"                                                         
[165] "The Sixth Sense"                                                     
[166] "The Truman Show"                                                     
[167] "Jurassic Park"                                                       
[168] "Wild Strawberries"                                                   
[169] "The Third Man"                                                       
[170] "Memories of Murder"                                                  
[171] "V for Vendetta"                                                      
[172] "Blade Runner"                                                        
[173] "Trainspotting"                                                       
[174] "Fargo"                                                               
[175] "The Bridge on the River Kwai"                                        
[176] "Inside Out"                                                          
[177] "Finding Nemo"                                                        
[178] "Kill Bill: Vol. 1"                                                   
[179] "Warrior"                                                             
[180] "Gone with the Wind"                                                  
[181] "Tokyo Story"                                                         
[182] "On the Waterfront"                                                   
[183] "My Father and My Son"                                                
[184] "Wild Tales"                                                          
[185] "Prisoners"                                                           
[186] "Stalker"                                                             
[187] "The Grand Budapest Hotel"                                            
[188] "The Deer Hunter"                                                     
[189] "The General"                                                         
[190] "Persona"                                                             
[191] "Gran Torino"                                                         
[192] "Sherlock Jr."                                                        
[193] "Before Sunrise"                                                      
[194] "Mary and Max"                                                        
[195] "Catch Me If You Can"                                                 
[196] "Mr. Smith Goes to Washington"                                        
[197] "Barry Lyndon"                                                        
[198] "In the Name of the Father"                                           
[199] "Dune"                                                                
[200] "Hacksaw Ridge"                                                       
[201] "Z"                                                                   
[202] "Gone Girl"                                                           
[203] "Room"                                                                
[204] "The Passion of Joan of Arc"                                          
[205] "Andhadhun"                                                           
[206] "Ford v Ferrari"                                                      
[207] "12 Years a Slave"                                                    
[208] "To Be or Not to Be"                                                  
[209] "The Big Lebowski"                                                    
[210] "Dead Poets Society"                                                  
[211] "Harry Potter and the Deathly Hallows: Part 2"                        
[212] "Ben-Hur"                                                             
[213] "How to Train Your Dragon"                                            
[214] "Mad Max: Fury Road"                                                  
[215] "Autumn Sonata"                                                       
[216] "Million Dollar Baby"                                                 
[217] "The Wages of Fear"                                                   
[218] "Stand by Me"                                                         
[219] "The Handmaiden"                                                      
[220] "Network"                                                             
[221] "Logan"                                                               
[222] "A Silent Voice: The Movie"                                           
[223] "La Haine"                                                            
[224] "Hachi: A Dog's Tale"                                                 
[225] "Cool Hand Luke"                                                      
[226] "Gangs of Wasseypur"                                                  
[227] "The 400 Blows"                                                       
[228] "Platoon"                                                             
[229] "Spotlight"                                                           
[230] "Monsters, Inc."                                                      
[231] "Rebecca"                                                             
[232] "Life of Brian"                                                       
[233] "In the Mood for Love"                                                
[234] "Hotel Rwanda"                                                        
[235] "The Bandit"                                                          
[236] "Rush"                                                                
[237] "Rocky"                                                               
[238] "Amores perros"                                                       
[239] "Into the Wild"                                                       
[240] "Nausicaä of the Valley of the Wind"                                  
[241] "Demon Slayer: Mugen Train"                                           
[242] "Before Sunset"                                                       
[243] "It Happened One Night"                                               
[244] "Fanny and Alexander"                                                 
[245] "Drishyam"                                                            
[246] "The Battle of Algiers"                                               
[247] "Nights of Cabiria"                                                   
[248] "Miracle in Cell No. 7"                                               
[249] "Andrei Rublev"                                                       
[250] "The Princess Bride"                                                  
  • Let point out things about the two datasets which make post-processing challenging.
  • I don’t resolve them here, but they require one to think about what questions you want answered first.
ratings$title
  [1] "Godfather, The"                                                      
  [2] "Shawshank Redemption, The"                                           
  [3] "Lord of the Rings: The Return of the King, The"                      
  [4] "Godfather: Part II, The"                                             
  [5] "Schindler's List"                                                    
  [6] "Shichinin no samurai"                                                
  [7] "Casablanca"                                                          
  [8] "Lord of the Rings: The Two Towers, The"                              
  [9] "Lord of the Rings: The Fellowship of the Ring, The"                  
 [10] "Star Wars"                                                           
 [11] "Citizen Kane"                                                        
 [12] "One Flew Over the Cuckoo's Nest"                                     
 [13] "Star Wars: Episode V - The Empire Strikes Back"                      
 [14] "Rear Window"                                                         
 [15] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
 [16] "Pulp Fiction"                                                        
 [17] "Raiders of the Lost Ark"                                             
 [18] "Usual Suspects, The"                                                 
 [19] "Memento"                                                             
 [20] "North by Northwest"                                                  
 [21] "Buono, il brutto, il cattivo, Il"                                    
 [22] "12 Angry Men"                                                        
 [23] "Lawrence of Arabia"                                                  
 [24] "Psycho"                                                              
 [25] "Fabuleux destin d'Amélie Poulain, Le"                                
 [26] "It's a Wonderful Life"                                               
 [27] "Silence of the Lambs, The"                                           
 [28] "Cidade de Deus"                                                      
 [29] "Goodfellas"                                                          
 [30] "American Beauty"                                                     
 [31] "Sunset Blvd."                                                        
 [32] "Vertigo"                                                             
 [33] "C'era una volta il West"                                             
 [34] "Matrix, The"                                                         
 [35] "Apocalypse Now"                                                      
 [36] "To Kill a Mockingbird"                                               
 [37] "Pianist, The"                                                        
 [38] "Third Man, The"                                                      
 [39] "Paths of Glory"                                                      
 [40] "Taxi Driver"                                                         
 [41] "Fight Club"                                                          
 [42] "Sen to Chihiro no kamikakushi"                                       
 [43] "Eternal Sunshine of the Spotless Mind"                               
 [44] "Some Like It Hot"                                                    
 [45] "Double Indemnity"                                                    
 [46] "Boot, Das"                                                           
 [47] "Singin' in the Rain"                                                 
 [48] "Chinatown"                                                           
 [49] "L.A. Confidential"                                                   
 [50] "M"                                                                   
 [51] "Maltese Falcon, The"                                                 
 [52] "Requiem for a Dream"                                                 
 [53] "Bridge on the River Kwai, The"                                       
 [54] "All About Eve"                                                       
 [55] "Se7en"                                                               
 [56] "Monty Python and the Holy Grail"                                     
 [57] "Rashômon"                                                            
 [58] "Saving Private Ryan"                                                 
 [59] "Raging Bull"                                                         
 [60] "Alien"                                                               
 [61] "Wizard of Oz, The"                                                   
 [62] "American History X"                                                  
 [63] "Léon"                                                                
 [64] "Sting, The"                                                          
 [65] "Mr. Smith Goes to Washington"                                        
 [66] "Treasure of the Sierra Madre, The"                                   
 [67] "Manchurian Candidate, The"                                           
 [68] "Vita è bella, La"                                                    
 [69] "Touch of Evil"                                                       
 [70] "Kill Bill: Vol. 1"                                                   
 [71] "Finding Nemo"                                                        
 [72] "Reservoir Dogs"                                                      
 [73] "2001: A Space Odyssey"                                               
 [74] "Great Escape, The"                                                   
 [75] "Clockwork Orange, A"                                                 
 [76] "Modern Times"                                                        
 [77] "Amadeus"                                                             
 [78] "Ran"                                                                 
 [79] "On the Waterfront"                                                   
 [80] "Annie Hall"                                                          
 [81] "Wo hu cang long"                                                     
 [82] "Jaws"                                                                
 [83] "Apartment, The"                                                      
 [84] "Braveheart"                                                          
 [85] "High Noon"                                                           
 [86] "Metropolis"                                                          
 [87] "Aliens"                                                              
 [88] "Fargo"                                                               
 [89] "Shining, The"                                                        
 [90] "Strangers on a Train"                                                
 [91] "City Lights"                                                         
 [92] "Blade Runner"                                                        
 [93] "Sixth Sense, The"                                                    
 [94] "Donnie Darko"                                                        
 [95] "General, The"                                                        
 [96] "Sjunde inseglet, Det"                                                
 [97] "Great Dictator, The"                                                 
 [98] "Duck Soup"                                                           
 [99] "Nuovo cinema Paradiso"                                               
[100] "Princess Bride, The"                                                 
[101] "Mononoke-hime"                                                       
[102] "Full Metal Jacket"                                                   
[103] "Rebecca"                                                             
[104] "Notorious"                                                           
[105] "Best Years of Our Lives, The"                                        
[106] "Yojimbo"                                                             
[107] "Big Sleep, The"                                                      
[108] "Ladri di biciclette"                                                 
[109] "Lola rennt"                                                          
[110] "Toy Story 2"                                                         
[111] "Butch Cassidy and the Sundance Kid"                                  
[112] "Patton"                                                              
[113] "Terminator 2: Judgment Day"                                          
[114] "It Happened One Night"                                               
[115] "Dogville"                                                            
[116] "Graduate, The"                                                       
[117] "Forrest Gump"                                                        
[118] "Deer Hunter, The"                                                    
[119] "Kill Bill: Vol. 2"                                                   
[120] "Glory"                                                               
[121] "Manhattan"                                                           
[122] "Mystic River"                                                        
[123] "Cool Hand Luke"                                                      
[124] "Philadelphia Story, The"                                             
[125] "Once Upon a Time in America"                                         
[126] "Searchers, The"                                                      
[127] "African Queen, The"                                                  
[128] "Unforgiven"                                                          
[129] "Ben-Hur"                                                             
[130] "Green Mile, The"                                                     
[131] "Hable con ella"                                                      
[132] "Star Wars: Episode VI - Return of the Jedi"                          
[133] "Bringing Up Baby"                                                    
[134] "Elephant Man, The"                                                   
[135] "Grapes of Wrath, The"                                                
[136] "Stalag 17"                                                           
[137] "Shrek"                                                               
[138] "Arsenic and Old Lace"                                                
[139] "Night of the Hunter, The"                                            
[140] "Gone with the Wind"                                                  
[141] "Indiana Jones and the Last Crusade"                                  
[142] "Straight Story, The"                                                 
[143] "Smultronstället"                                                     
[144] "Christmas Story, A"                                                  
[145] "Back to the Future"                                                  
[146] "Wild Bunch, The"                                                     
[147] "Platoon"                                                             
[148] "Amores perros"                                                       
[149] "All Quiet on the Western Front"                                      
[150] "Hustler, The"                                                        
[151] "Spider-Man 2"                                                        
[152] "Lost in Translation"                                                 
[153] "Young Frankenstein"                                                  
[154] "Adventures of Robin Hood, The"                                       
[155] "Gold Rush, The"                                                      
[156] "His Girl Friday"                                                     
[157] "Die Hard"                                                            
[158] "Monsters, Inc."                                                      
[159] "Bronenosets Potyomkin"                                               
[160] "Life of Brian"                                                       
[161] "Quatre cents coups, Les"                                             
[162] "Grande illusion, La"                                                 
[163] "Spartacus"                                                           
[164] "Man Who Shot Liberty Valance, The"                                   
[165] "Witness for the Prosecution"                                         
[166] "Charade"                                                             
[167] "Ying xiong"                                                          
[168] "Gladiator"                                                           
[169] "Festen"                                                              
[170] "Sling Blade"                                                         
[171] "Conversation, The"                                                   
[172] "Roman Holiday"                                                       
[173] "Toy Story"                                                           
[174] "Magnolia"                                                            
[175] "Almost Famous"                                                       
[176] "Night at the Opera, A"                                               
[177] "Hotaru no haka"                                                      
[178] "Day the Earth Stood Still, The"                                      
[179] "Trois couleurs: Rouge"                                               
[180] "Streetcar Named Desire, A"                                           
[181] "Ed Wood"                                                             
[182] "All the President's Men"                                             
[183] "To Be or Not to Be"                                                  
[184] "Insider, The"                                                        
[185] "Brazil"                                                              
[186] "Killing, The"                                                        
[187] "Pirates of the Caribbean: The Curse of the Black Pearl"              
[188] "Shadow of a Doubt"                                                   
[189] "21 Grams"                                                            
[190] "Being John Malkovich"                                                
[191] "Who's Afraid of Virginia Woolf?"                                     
[192] "Mulholland Dr."                                                      
[193] "Exorcist, The"                                                       
[194] "Harvey"                                                              
[195] "Dog Day Afternoon"                                                   
[196] "Stand by Me"                                                         
[197] "Nosferatu, eine Symphonie des Grauens"                               
[198] "Gandhi"                                                              
[199] "Big Fish"                                                            
[200] "Twelve Monkeys"                                                      
[201] "Terminator, The"                                                     
[202] "Trainspotting"                                                       
[203] "Ikiru"                                                               
[204] "Groundhog Day"                                                       
[205] "Lion in Winter, The"                                                 
[206] "This Is Spinal Tap"                                                  
[207] "Miller's Crossing"                                                   
[208] "8½"                                                                  
[209] "Right Stuff, The"                                                    
[210] "Passion de Jeanne d'Arc, La"                                         
[211] "Whale Rider"                                                         
[212] "Strada, La"                                                          
[213] "In America"                                                          
[214] "Rain Man"                                                            
[215] "Network"                                                             
[216] "Laura"                                                               
[217] "Adaptation."                                                         
[218] "Bonnie and Clyde"                                                    
[219] "39 Steps, The"                                                       
[220] "Snatch."                                                             
[221] "King Kong"                                                           
[222] "Midnight Cowboy"                                                     
[223] "Stagecoach"                                                          
[224] "Lock, Stock and Two Smoking Barrels"                                 
[225] "X2"                                                                  
[226] "Big Lebowski, The"                                                   
[227] "In the Heat of the Night"                                            
[228] "Thin Man, The"                                                       
[229] "Rio Bravo"                                                           
[230] "Untouchables, The"                                                   
[231] "Others, The"                                                         
[232] "Sunrise: A Song of Two Humans"                                       
[233] "Planet of the Apes"                                                  
[234] "Bride of Frankenstein"                                               
[235] "Kind Hearts and Coronets"                                            
[236] "Beauty and the Beast"                                                
[237] "Red River"                                                           
[238] "Die xue shuang xiong"                                                
[239] "Traffic"                                                             
[240] "Minority Report"                                                     
[241] "Sleuth"                                                              
[242] "Persona"                                                             
[243] "Enfants du paradis, Les"                                             
[244] "Being There"                                                         
[245] "Good Will Hunting"                                                   
[246] "Fantasia"                                                            
[247] "Todo sobre mi madre"                                                 
[248] "Fanny och Alexander"                                                 
[249] "Heat"                                                                
[250] "Sullivan's Travels"                                                  
left_join(ratings, ratings.book, by="title")
# A tibble: 250 × 9
   rank.x title            year.x rating.x  votes rank.y year.y rating.y  number
    <dbl> <chr>            <chr>     <dbl>  <dbl> <chr>  <chr>     <dbl>   <dbl>
 1      1 Godfather, The   1972        9    97616 <NA>   <NA>       NA        NA
 2      2 Shawshank Redem… 1994        8.9 120632 <NA>   <NA>       NA        NA
 3      3 Lord of the Rin… 2003        8.9  66781 <NA>   <NA>       NA        NA
 4      4 Godfather: Part… 1974        8.8  57761 <NA>   <NA>       NA        NA
 5      5 Schindler's List 1993        8.7  82282 6      1993        8.9 1295705
 6      6 Shichinin no sa… 1954        8.7  24675 <NA>   <NA>       NA        NA
 7      7 Casablanca       1942        8.7  55763 49     1942        8.4  552614
 8      8 Lord of the Rin… 2002        8.7  88323 <NA>   <NA>       NA        NA
 9      9 Lord of the Rin… 2001        8.7 130513 <NA>   <NA>       NA        NA
10     10 Star Wars        1977        8.7 114510 <NA>   <NA>       NA        NA
# ℹ 240 more rows

Example: Quarantine Zine Club

  • Task is to create a spreadsheet where we have the titles of the zines, their authors, the links to download the zines, and the first social media link of the authors.
  • Used xpath here, but was not discussed in book.
url <- "https://quarantinezineclub.neocities.org/zinelibrary"
html <- read_html(url) 
zinelib <- tibble(authors = html |> html_elements("#zname") |> html_text2(), 
       titles = html |> html_elements("#libraryheader") |> html_text2(),
       pdflink = html |> 
  html_elements(xpath = "//div[@class='column right']") |> html_element("a") |> html_attr("href"),
       mark = 2:156
       )
get.first.social <- function(num)
{
  return(html |> html_elements(xpath = paste("/html/body/div[2]/div[", num, "]/div[2]/a[1]", sep = ""))|> html_attr("href"))
}
zinelib <- zinelib |> mutate(
  pdflink = paste("https://quarantinezineclub.neocities.org/", pdflink, sep = ""),
  social1 = get.first.social(mark)) |> select(-mark)
zinelib
# A tibble: 155 × 4
   authors         titles                                     pdflink    social1
   <chr>           <chr>                                      <chr>      <chr>  
 1 Billyszine      Spooks - Hallozeen 2020                    https://q… https:…
 2 Miles Davitt    Quarantine Comix                           https://q… https:…
 3 Riley Gunderson Sanity                                     https://q… https:…
 4 Nhu Duong       Some Days (feel like a fight)              https://q… https:…
 5 Jesse Dekel     Munted (Colour)                            https://q… https:…
 6 Jesse Dekel     Munted (B&W)                               https://q… https:…
 7 Jaye Sosa       Homeroom                                   https://q… https:…
 8 Jaye Sosa       A Small Spicy Guide to Mexican Hot Peppers https://q… https:…
 9 Kate Dunn       FEAR OF FLYING                             https://q… https:…
10 Luke You        YOU (03.10.20)                             https://q… https:…
# ℹ 145 more rows