In this post, I will make a Letterboxd scrapper. In case you didn’t know, Letterboxd is a film logging site with built-in social networking features. It is a great platform, but the company does not provide a public API for their data, so one has to use a scaper to get their hands on the film/user data.

For this scraper, given a username, we want all the films the user has logged. Then, we will want to collect the

  • runtime
  • director(s)
  • country
  • languages
  • genre(s)
  • year
  • user’s rating

of each film.

We will also utilize Python’s concurrent processing to significantly reduce the time it takes to scrap our data by optimizing how we send our requests. Without it, scraping is frustratingly slow even when scraping users with fewer than 200 entries in their logs. For the HTML parsing and scraping, we will use the Python package BeautifulSoup.

You can find the Jupyter notebook for this post here and the standalone script that exports a csv here

import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
import numpy as np

#from multiprocessing import Pool
import concurrent.futures
def getNumPages(username):
    baseurl = 'https://letterboxd.com/{}/films'.format(username)
    r = requests.get(baseurl)
    sp = BeautifulSoup(r.text, 'html.parser')
    try:
        page = int(sp.select("li.paginate-page")[-1].text)
    except:
        page = int() # for those users whose logged films span just one page
    return page

Given a link that contains a wall of films (such as the paged user /films/ pages), this function collects all the links of the films on that page.

def the_filmlinks(url):#
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    lbxbaseurl = "https://letterboxd.com/"
    return [
        lbxbaseurl + thing.get("data-target-link") for thing in sp.select(".really-lazy-load")
    ]

This function collects all the links of films from each page:

def getAllLinks(username): #
    pages = getNumPages(username)
    baseurl = "https://letterboxd.com/{}/films/page/".format(username)
    #links = [] This is slower, so use ThreadPoolExecutor below
    #for page in range(pages+1):
    #    for item in get_film_links(baseurl+str(page)):
    #      links.append(item)
    
    pagelinks = [baseurl+str(i) for i in range(pages+1)]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for pagelink in pagelinks:
            futures.append(executor.submit(the_filmlinks, url=pagelink))
        links = [future.result() for future in concurrent.futures.as_completed(futures)]
    
    return [link for llink in links for link in llink]

The function below is pretty much independent from the rest as it is just regex and BeautifulSoup selector code.

def the_details(url): #independent of the rest, lots of BeautifulSoup code went into this
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    ratingblob = sp.select("head > meta:nth-child(20)")[0]
    
    if ratingblob.get("content").split()[0] == 'Letterboxd':
        rating_c = np.nan
    else:
        rating_c = float(ratingblob.get("content").split()[0])
        
    tmdbblob = sp.find('a', attrs={'data-track-action': 'TMDb'})
    directors = [name.text for name in sp.select("span.prettify")]
    
    res = re.search(r'\/movie\/(\d+)\/', tmdbblob.get("href")) # This grabs the TMDB
    #link; entries that aren't movies do not have a TMDB link, so we give them id 0
    if res:
        id = sp.find(class_="really-lazy-load").get("data-film-id")
    else:
        id = 0

    ### Stubs    
    genrestub = sp.select('a[href^="/films/genre/"]')

    try:
        countrystub = sp.select('a[href^="/films/country/"]')[0] #/films/country/usa/
        country = re.search(r"/country/(\w+)/", countrystub.get("href")).group(1)
    except:
        country = 0

    try:
        languagestub = sp.select('a[href^="/films/language/"]')
        langs = {languagestub[i].text for i in range(len(languagestub))} 
        #use set because original language, spoken languages repetition 
    except:
        langs = 0
        

    
    film ={
        'film_id': int(id), #will be used to exclude tv shows
        'film_title': sp.select_one("h1.headline-1").text,
        'film_year': int(sp.select_one("small.number").text),
        'director': [name.text for name in sp.select("span.prettify")],
        'average_rating': rating_c,
        'runtime': int(re.search(r'\d+', sp.select_one("p.text-link").text).group()),
        'country': country,
        'genres': [genrestub[i].text for i in range(len(genrestub))],
        'languages': langs #sp.select('a[href^="/films/language/"]')[0].text
        #'actors': []
    }
    return film

Let us test it out with this film:

the_details('https://letterboxd.com/film/knives-out-2019/')
{'film_id_tv': 475370,
 'film_title': 'Knives Out',
 'film_year': 2019,
 'director': ['Rian Johnson'],
 'average_rating': 4.01,
 'runtime': 131,
 'country': 'usa',
 'genres': ['Mystery', 'Comedy', 'Crime'],
 'languages': {'English', 'Spanish'}}

Great!


Now, we collect the film details from all of the films a given user has logged. This excludes the user’s rating, which will be scraped later (see below). The function is also optimized by using ThreadPoolExecutor. This is important since there is a lot of wasted downtime in between scraping and sending requests.

def getLoggedFilmDetails(username): #non-user related details
    #film_details = []
    urls = getAllLinks(username)
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for url in urls:
            futures.append(executor.submit(the_details,url))
        results = [future.result()
                   for future in concurrent.futures.as_completed(futures)]
    return results
pd.DataFrame(getLoggedFilmDetails('indiewire'))
film_idfilm_titlefilm_yeardirectoraverage_ratingruntimecountrygenreslanguages
0228594Blonde2022[Andrew Dominik]2.04167usa[Drama]{Italian, English}
1801082A Man Named Scott2021[Robert Alexander]3.8395usa[Music, Documentary]{English}
2560787Spider-Man: No Way Home2021[Jon Watts]3.86148usa[Action, Adventure, Science Fiction]{Tagalog, English}
3519052The Tragedy of Macbeth2021[Joel Coen]3.77105usa[War, Drama]{English}
4565654The Addams Family 22021[Conrad Vernon, Greg Tiernan]2.2893canada[Animation, Fantasy, Horror, Family, Comedy, A...{Spanish, Latin, Ukrainian, English}
..............................
205457180Raya and the Last Dragon2021[Don Hall, Carlos López Estrada]3.30107usa[Family, Action, Animation, Fantasy, Adventure]{English}
206713525I'm Your Man2021[Maria Schrader]3.55108germany[Comedy, Science Fiction, Romance]{German, Korean, Spanish, French, English}
207508523Crisis2021[Nicholas Jarecki]2.73118belgium[Thriller, Drama, Crime]{German, English}
208473304Cherry2021[Anthony Russo, Joe Russo]2.81140usa[Crime, Drama]{English}
209579476Ninjababy2021[Yngvild Sve Flikke]3.81104norway[Comedy, Drama]{Norwegian}

210 rows × 9 columns


Now, we collect user ratings

def getRatings(username):
    pages = getNumPages(username)
    baseurl = "https://letterboxd.com/{}/films/page/".format(username)
    rateid = []
    stars = {
        "★": 1, "★★": 2, "★★★": 3, "★★★★": 4, "★★★★★": 5, "½": 0.5, "★½": 1.5, "★★½": 2.5, 
        "★★★½": 3.5, "★★★★½": 4.5
      }

    for page in range(pages+1):
        film_p = baseurl+str(page)
        soup_p = BeautifulSoup(requests.get(film_p).text,'html.parser')
        for thing in soup_p.find_all('li', class_="poster-container"):
            try:
                userrating=stars[thing.find(class_="rating").get_text().strip()]
            except:
                userrating=np.nan
            
            filmp = {
                'film_id':int(thing.find(class_="really-lazy-load").get("data-film-id")),
                'user_rating': userrating
            }
            rateid.append(filmp)
  
    return rateid

Try it out with a user:

pd.DataFrame(getRatings('indiewire'))
film_iduser_rating
02285942.5
19050693.0
26662693.5
33855113.0
47771854.5
.........
2053996331.5
2064685971.5
2074481643.0
2083812863.0
20911370NaN

210 rows × 2 columns

This will be useful for the recommendation engine part of my project, which is part of the reason I decided to write a scraper in the first place. Other than that I thought it would be a great way to hone my Python skills and revive this blog :)

I hope you found this write-up/code useful!