(How to write) A Letterboxd Scraper

In this post, I will make a Letterboxd scrapper. In case you didn’t know, Letterboxd is a film logging site with built-in social networking features. It is a great platform, but the company does not provide a public API for their data, so one has to use a scaper to get their hands on the film/user data.

For this scraper, given a username, we want all the films the user has logged. Then, we will want to collect the

runtime
director(s)
country
languages
genre(s)
year
user’s rating

of each film.

We will also utilize Python’s concurrent processing to significantly reduce the time it takes to scrap our data by optimizing how we send our requests. Without it, scraping is frustratingly slow even when scraping users with fewer than 200 entries in their logs. For the HTML parsing and scraping, we will use the Python package BeautifulSoup.

You can find the Jupyter notebook for this post here and the standalone script that exports a csv here

import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
import numpy as np

#from multiprocessing import Pool
import concurrent.futures

def getNumPages(username):
    baseurl = 'https://letterboxd.com/{}/films'.format(username)
    r = requests.get(baseurl)
    sp = BeautifulSoup(r.text, 'html.parser')
    try:
        page = int(sp.select("li.paginate-page")[-1].text)
    except:
        page = int() # for those users whose logged films span just one page
    return page

Given a link that contains a wall of films (such as the paged user /films/ pages), this function collects all the links of the films on that page.

def the_filmlinks(url):#
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    lbxbaseurl = "https://letterboxd.com/"
    return [
        lbxbaseurl + thing.get("data-target-link") for thing in sp.select(".really-lazy-load")
    ]

This function collects all the links of films from each page:

def getAllLinks(username): #
    pages = getNumPages(username)
    baseurl = "https://letterboxd.com/{}/films/page/".format(username)
    #links = [] This is slower, so use ThreadPoolExecutor below
    #for page in range(pages+1):
    #    for item in get_film_links(baseurl+str(page)):
    #      links.append(item)
    
    pagelinks = [baseurl+str(i) for i in range(pages+1)]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for pagelink in pagelinks:
            futures.append(executor.submit(the_filmlinks, url=pagelink))
        links = [future.result() for future in concurrent.futures.as_completed(futures)]
    
    return [link for llink in links for link in llink]

The function below is pretty much independent from the rest as it is just regex and BeautifulSoup selector code.

def the_details(url): #independent of the rest, lots of BeautifulSoup code went into this
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    ratingblob = sp.select("head > meta:nth-child(20)")[0]
    
    if ratingblob.get("content").split()[0] == 'Letterboxd':
        rating_c = np.nan
    else:
        rating_c = float(ratingblob.get("content").split()[0])
        
    tmdbblob = sp.find('a', attrs={'data-track-action': 'TMDb'})
    directors = [name.text for name in sp.select("span.prettify")]
    
    res = re.search(r'\/movie\/(\d+)\/', tmdbblob.get("href")) # This grabs the TMDB
    #link; entries that aren't movies do not have a TMDB link, so we give them id 0
    if res:
        id = sp.find(class_="really-lazy-load").get("data-film-id")
    else:
        id = 0

    ### Stubs    
    genrestub = sp.select('a[href^="/films/genre/"]')

    try:
        countrystub = sp.select('a[href^="/films/country/"]')[0] #/films/country/usa/
        country = re.search(r"/country/(\w+)/", countrystub.get("href")).group(1)
    except:
        country = 0

    try:
        languagestub = sp.select('a[href^="/films/language/"]')
        langs = {languagestub[i].text for i in range(len(languagestub))} 
        #use set because original language, spoken languages repetition 
    except:
        langs = 0
        

    
    film ={
        'film_id': int(id), #will be used to exclude tv shows
        'film_title': sp.select_one("h1.headline-1").text,
        'film_year': int(sp.select_one("small.number").text),
        'director': [name.text for name in sp.select("span.prettify")],
        'average_rating': rating_c,
        'runtime': int(re.search(r'\d+', sp.select_one("p.text-link").text).group()),
        'country': country,
        'genres': [genrestub[i].text for i in range(len(genrestub))],
        'languages': langs #sp.select('a[href^="/films/language/"]')[0].text
        #'actors': []
    }
    return film

Let us test it out with this film:

the_details('https://letterboxd.com/film/knives-out-2019/')

{'film_id_tv': 475370,
 'film_title': 'Knives Out',
 'film_year': 2019,
 'director': ['Rian Johnson'],
 'average_rating': 4.01,
 'runtime': 131,
 'country': 'usa',
 'genres': ['Mystery', 'Comedy', 'Crime'],
 'languages': {'English', 'Spanish'}}

Great!

Now, we collect the film details from all of the films a given user has logged. This excludes the user’s rating, which will be scraped later (see below). The function is also optimized by using ThreadPoolExecutor. This is important since there is a lot of wasted downtime in between scraping and sending requests.

def getLoggedFilmDetails(username): #non-user related details
    #film_details = []
    urls = getAllLinks(username)
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for url in urls:
            futures.append(executor.submit(the_details,url))
        results = [future.result()
                   for future in concurrent.futures.as_completed(futures)]
    return results

pd.DataFrame(getLoggedFilmDetails('indiewire'))

	film_id	film_title	film_year	director	average_rating	runtime	country	genres	languages
0	228594	Blonde	2022	[Andrew Dominik]	2.04	167	usa	[Drama]	{Italian, English}
1	801082	A Man Named Scott	2021	[Robert Alexander]	3.83	95	usa	[Music, Documentary]	{English}
2	560787	Spider-Man: No Way Home	2021	[Jon Watts]	3.86	148	usa	[Action, Adventure, Science Fiction]	{Tagalog, English}
3	519052	The Tragedy of Macbeth	2021	[Joel Coen]	3.77	105	usa	[War, Drama]	{English}
4	565654	The Addams Family 2	2021	[Conrad Vernon, Greg Tiernan]	2.28	93	canada	[Animation, Fantasy, Horror, Family, Comedy, A...	{Spanish, Latin, Ukrainian, English}
...	...	...	...	...	...	...	...	...	...
205	457180	Raya and the Last Dragon	2021	[Don Hall, Carlos López Estrada]	3.30	107	usa	[Family, Action, Animation, Fantasy, Adventure]	{English}
206	713525	I'm Your Man	2021	[Maria Schrader]	3.55	108	germany	[Comedy, Science Fiction, Romance]	{German, Korean, Spanish, French, English}
207	508523	Crisis	2021	[Nicholas Jarecki]	2.73	118	belgium	[Thriller, Drama, Crime]	{German, English}
208	473304	Cherry	2021	[Anthony Russo, Joe Russo]	2.81	140	usa	[Crime, Drama]	{English}
209	579476	Ninjababy	2021	[Yngvild Sve Flikke]	3.81	104	norway	[Comedy, Drama]	{Norwegian}

210 rows × 9 columns

Now, we collect user ratings

def getRatings(username):
    pages = getNumPages(username)
    baseurl = "https://letterboxd.com/{}/films/page/".format(username)
    rateid = []
    stars = {
        "★": 1, "★★": 2, "★★★": 3, "★★★★": 4, "★★★★★": 5, "½": 0.5, "★½": 1.5, "★★½": 2.5, 
        "★★★½": 3.5, "★★★★½": 4.5
      }

    for page in range(pages+1):
        film_p = baseurl+str(page)
        soup_p = BeautifulSoup(requests.get(film_p).text,'html.parser')
        for thing in soup_p.find_all('li', class_="poster-container"):
            try:
                userrating=stars[thing.find(class_="rating").get_text().strip()]
            except:
                userrating=np.nan
            
            filmp = {
                'film_id':int(thing.find(class_="really-lazy-load").get("data-film-id")),
                'user_rating': userrating
            }
            rateid.append(filmp)
  
    return rateid

Try it out with a user:

pd.DataFrame(getRatings('indiewire'))

	film_id	user_rating
0	228594	2.5
1	905069	3.0
2	666269	3.5
3	385511	3.0
4	777185	4.5
...	...	...
205	399633	1.5
206	468597	1.5
207	448164	3.0
208	381286	3.0
209	11370	NaN

210 rows × 2 columns

This will be useful for the recommendation engine part of my project, which is part of the reason I decided to write a scraper in the first place. Other than that I thought it would be a great way to hone my Python skills and revive this blog :)

I hope you found this write-up/code useful!