In this post, I will make a Letterboxd scrapper. In case you didn’t know, Letterboxd is a film logging site with built-in social networking features. It is a great platform, but the company does not provide a public API for their data, so one has to use a scaper to get their hands on the film/user data.
For this scraper, given a username, we want all the films the user has logged. Then, we will want to collect the
- runtime
- director(s)
- country
- languages
- genre(s)
- year
- user’s rating
of each film.
We will also utilize Python’s concurrent processing to significantly reduce the time it takes to scrap our data by optimizing how we send our requests. Without it, scraping is frustratingly slow even when scraping users with fewer than 200 entries in their logs. For the HTML parsing and scraping, we will use the Python package BeautifulSoup.
You can find the Jupyter notebook for this post here and the standalone script that exports a csv here
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
import numpy as np
#from multiprocessing import Pool
import concurrent.futures
def getNumPages(username):
baseurl = 'https://letterboxd.com/{}/films'.format(username)
r = requests.get(baseurl)
sp = BeautifulSoup(r.text, 'html.parser')
try:
page = int(sp.select("li.paginate-page")[-1].text)
except:
page = int() # for those users whose logged films span just one page
return page
Given a link that contains a wall of films (such as the paged user /films/
pages), this function collects all the links of the films on that page.
def the_filmlinks(url):#
r = requests.get(url)
sp = BeautifulSoup(r.text, 'html.parser')
lbxbaseurl = "https://letterboxd.com/"
return [
lbxbaseurl + thing.get("data-target-link") for thing in sp.select(".really-lazy-load")
]
This function collects all the links of films from each page:
def getAllLinks(username): #
pages = getNumPages(username)
baseurl = "https://letterboxd.com/{}/films/page/".format(username)
#links = [] This is slower, so use ThreadPoolExecutor below
#for page in range(pages+1):
# for item in get_film_links(baseurl+str(page)):
# links.append(item)
pagelinks = [baseurl+str(i) for i in range(pages+1)]
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for pagelink in pagelinks:
futures.append(executor.submit(the_filmlinks, url=pagelink))
links = [future.result() for future in concurrent.futures.as_completed(futures)]
return [link for llink in links for link in llink]
The function below is pretty much independent from the rest as it is just regex and BeautifulSoup selector code.
def the_details(url): #independent of the rest, lots of BeautifulSoup code went into this
r = requests.get(url)
sp = BeautifulSoup(r.text, 'html.parser')
ratingblob = sp.select("head > meta:nth-child(20)")[0]
if ratingblob.get("content").split()[0] == 'Letterboxd':
rating_c = np.nan
else:
rating_c = float(ratingblob.get("content").split()[0])
tmdbblob = sp.find('a', attrs={'data-track-action': 'TMDb'})
directors = [name.text for name in sp.select("span.prettify")]
res = re.search(r'\/movie\/(\d+)\/', tmdbblob.get("href")) # This grabs the TMDB
#link; entries that aren't movies do not have a TMDB link, so we give them id 0
if res:
id = sp.find(class_="really-lazy-load").get("data-film-id")
else:
id = 0
### Stubs
genrestub = sp.select('a[href^="/films/genre/"]')
try:
countrystub = sp.select('a[href^="/films/country/"]')[0] #/films/country/usa/
country = re.search(r"/country/(\w+)/", countrystub.get("href")).group(1)
except:
country = 0
try:
languagestub = sp.select('a[href^="/films/language/"]')
langs = {languagestub[i].text for i in range(len(languagestub))}
#use set because original language, spoken languages repetition
except:
langs = 0
film ={
'film_id': int(id), #will be used to exclude tv shows
'film_title': sp.select_one("h1.headline-1").text,
'film_year': int(sp.select_one("small.number").text),
'director': [name.text for name in sp.select("span.prettify")],
'average_rating': rating_c,
'runtime': int(re.search(r'\d+', sp.select_one("p.text-link").text).group()),
'country': country,
'genres': [genrestub[i].text for i in range(len(genrestub))],
'languages': langs #sp.select('a[href^="/films/language/"]')[0].text
#'actors': []
}
return film
Let us test it out with this film:
the_details('https://letterboxd.com/film/knives-out-2019/')
{'film_id_tv': 475370,
'film_title': 'Knives Out',
'film_year': 2019,
'director': ['Rian Johnson'],
'average_rating': 4.01,
'runtime': 131,
'country': 'usa',
'genres': ['Mystery', 'Comedy', 'Crime'],
'languages': {'English', 'Spanish'}}
Great!
Now, we collect the film details from all of the films a given user has logged. This excludes the user’s rating, which will be scraped later (see below). The function is also optimized by using ThreadPoolExecutor
. This is important since there is a lot of wasted downtime in between scraping and sending requests.
def getLoggedFilmDetails(username): #non-user related details
#film_details = []
urls = getAllLinks(username)
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for url in urls:
futures.append(executor.submit(the_details,url))
results = [future.result()
for future in concurrent.futures.as_completed(futures)]
return results
pd.DataFrame(getLoggedFilmDetails('indiewire'))
film_id | film_title | film_year | director | average_rating | runtime | country | genres | languages | |
---|---|---|---|---|---|---|---|---|---|
0 | 228594 | Blonde | 2022 | [Andrew Dominik] | 2.04 | 167 | usa | [Drama] | {Italian, English} |
1 | 801082 | A Man Named Scott | 2021 | [Robert Alexander] | 3.83 | 95 | usa | [Music, Documentary] | {English} |
2 | 560787 | Spider-Man: No Way Home | 2021 | [Jon Watts] | 3.86 | 148 | usa | [Action, Adventure, Science Fiction] | {Tagalog, English} |
3 | 519052 | The Tragedy of Macbeth | 2021 | [Joel Coen] | 3.77 | 105 | usa | [War, Drama] | {English} |
4 | 565654 | The Addams Family 2 | 2021 | [Conrad Vernon, Greg Tiernan] | 2.28 | 93 | canada | [Animation, Fantasy, Horror, Family, Comedy, A... | {Spanish, Latin, Ukrainian, English} |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
205 | 457180 | Raya and the Last Dragon | 2021 | [Don Hall, Carlos López Estrada] | 3.30 | 107 | usa | [Family, Action, Animation, Fantasy, Adventure] | {English} |
206 | 713525 | I'm Your Man | 2021 | [Maria Schrader] | 3.55 | 108 | germany | [Comedy, Science Fiction, Romance] | {German, Korean, Spanish, French, English} |
207 | 508523 | Crisis | 2021 | [Nicholas Jarecki] | 2.73 | 118 | belgium | [Thriller, Drama, Crime] | {German, English} |
208 | 473304 | Cherry | 2021 | [Anthony Russo, Joe Russo] | 2.81 | 140 | usa | [Crime, Drama] | {English} |
209 | 579476 | Ninjababy | 2021 | [Yngvild Sve Flikke] | 3.81 | 104 | norway | [Comedy, Drama] | {Norwegian} |
210 rows × 9 columns
Now, we collect user ratings
def getRatings(username):
pages = getNumPages(username)
baseurl = "https://letterboxd.com/{}/films/page/".format(username)
rateid = []
stars = {
"★": 1, "★★": 2, "★★★": 3, "★★★★": 4, "★★★★★": 5, "½": 0.5, "★½": 1.5, "★★½": 2.5,
"★★★½": 3.5, "★★★★½": 4.5
}
for page in range(pages+1):
film_p = baseurl+str(page)
soup_p = BeautifulSoup(requests.get(film_p).text,'html.parser')
for thing in soup_p.find_all('li', class_="poster-container"):
try:
userrating=stars[thing.find(class_="rating").get_text().strip()]
except:
userrating=np.nan
filmp = {
'film_id':int(thing.find(class_="really-lazy-load").get("data-film-id")),
'user_rating': userrating
}
rateid.append(filmp)
return rateid
Try it out with a user:
pd.DataFrame(getRatings('indiewire'))
film_id | user_rating | |
---|---|---|
0 | 228594 | 2.5 |
1 | 905069 | 3.0 |
2 | 666269 | 3.5 |
3 | 385511 | 3.0 |
4 | 777185 | 4.5 |
... | ... | ... |
205 | 399633 | 1.5 |
206 | 468597 | 1.5 |
207 | 448164 | 3.0 |
208 | 381286 | 3.0 |
209 | 11370 | NaN |
210 rows × 2 columns
This will be useful for the recommendation engine part of my project, which is part of the reason I decided to write a scraper in the first place. Other than that I thought it would be a great way to hone my Python skills and revive this blog :)
I hope you found this write-up/code useful!