Quantifying Sufjan Stevens with the Genius API and NLTK

I am an avid Sufjan Stevens fan. I really enjoy his peaceful, somewhat melancholy style of music. As a person who is not religious myself, one of the things that I find interesting about his songs is that religious - specifically Christian - themes seem to appear quite broadly across his oeuvre. A song will often have a superficial, literal interpretation, and potentially another allegorical interpretation with religious connotations. Stevens has discussed his faith, and how it influences his music, in several interviews. Nonetheless, discussions rage online as to whether his songs should be viewed through a lens of religion or simply taken at face value.

I was thinking about this yesterday while listening to his song about the elusive, possibly extinct ivory-billed woodpecker, 'Lord God Bird':

It made me wonder how often Sufjan Stevens actually references God in his work. Is he talking more, or less, about religion as time goes on? Since I have been itching to do some lighthearted Python coding, I decided I'd try to find out using www.genius.com as a lyrics resource. Fortunately, Genius has a decent REST API which can be accessed using an OAuth token. For experiments like mine with no users, it's possible to generate a client token after having registered an application.

Firstly, I needed to get a complete list of Sufjan songs to work with. Using requests, it was fairly easy to throw together a minimal API client that accessed the artist songs endpoint at api.genius.com/artist/<id>/songs. This endpoint only returns 20 results by default, so it was necessary to page through until I'd made a list of all songs by the artist.

import requests

CLIENT_ACCESS_TOKEN = "some_token"

BASE_URI = "https://api.genius.com"

def _get(path, params=None, headers=None):

    url = '/'.join([BASE_URI, path])

    token = "Bearer {}".format(CLIENT_ACCESS_TOKEN)

    if headers:
        headers['Authorization'] = token
    else:
        headers = {"Authorization": token}

    response = requests.get(url=url, params=params, headers=headers)
    response.raise_for_status()

    return response.json()

def get_artist_songs(artist_id):

    current_page = 1
    next_page = True
    songs = []

    while next_page:

        path = "artists/{}/songs/".format(artist_id)
        params = {'page': current_page}
        data = _get(path=path, params=params)

        page_songs = data['response']['songs']

        if page_songs:
            songs += page_songs
            current_page += 1
        else:
            next_page = False

    return songs

This is a fairly naive implementation - it'll keep paging through until it gets to the first empty page of results - but it worked fine for my purposes. I chose to separate out the request logic because I wanted to implement some other API calls using the same approach.

Running the get_artist_songs function with artist_id=958 got me a list of songs by Sufjan Stevens. I filtered out any that didn't list him as the primary artist:

songs = get_artist_songs(958)
songs = [song for song in songs if song['primary_artist']['id'] == 958]

This left me with 284 Sufjan songs. I didn't do any filtering beyond this, so it's possible that the data set contains duplicates - I don't know how strict Genius is about removing multiple instances of the same song.

Now that I had my song list, I needed to find the lyrics for each song. Unfortunately, this wasn't a field that was present in the song objects I'd got back from the API. However, I had the URL of the song itself on Genius. I used BeautifulSoup, another of my favourite Python libraries, to grab the song lyrics directly from the HTML of the page... or that's what I would have done if it were permitted by the Genius Terms of Service. To be honest, since Genius are shamelessly annotating other people's web content I'm sure they'll be cool with this. Here's what that code might potentially look like:

import requests
from bs4 import BeautifulSoup

def scrape_lyrics(url):

    response = requests.get(url)
    html = response.text

    soup = BeautifulSoup(html, 'html.parser')

    lyrics = soup.find(name="lyrics")
    lyrics.script.decompose()

    lyrics_text = " / ".join([lyric for lyric in lyrics.stripped_strings])
    return lyrics_text

BeautifulSoup makes things so simple. It's easy to find the lyrics tag in the HTML, and then get all child strings within the tag using lyrics.stripped_strings. stripped_strings just removes extraneous whitespace, but otherwise behaves like the more commonly-used strings. I joined the strings together with / for aesthetic purposes, but it's really not necessary.

Using this function, I was able to retrieve the lyrics for all 284 songs on my list.

for song in songs:
    url = song['url']
    lyrics = scrape_lyrics(url)
    song['lyrics'] = lyrics

Armed with this trove of data, I could start doing some rudimentary analysis. My initial idea was to use the release dates of songs to investigate whether Sufjan had become more, or less, godly over time. However, only 34 of the 284 songs had release date information available through the Genius API, so that idea was quickly nipped in the bud. If I'm feeling motivated at some point, I may revisit this and combine the data with another information source to get a better picture of God references per year.

What I can tell you, however, is which of Sufjan's songs feature the most references to God. I chose the words 'God', 'Lord', 'Jesus' and 'Christ' as explicit references - if I missed any, please let me know.

First, let's look at the lyrics in aggregate:

import nltk

stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}', '/']) 

all_lyrics = " ".join([song['lyrics'] for song in songs])

all_tokens = nltk.word_tokenize(all_lyrics)
all_tokens = [token.lower() for token in all_tokens if token.lower() not in stop_words]
fdist = nltk.FreqDist(all_tokens)

Here I am bringing NLTK to bear, admittedly at a very rudimentary level. I tokenized the lyrics, splitting them up into individual words, and then built a frequency distribution from the tokens. I removed stop words and punctuation to make the results less noisy.

The fdist object has a couple of fun options. Looking directly at the words we chose above, we can see how often Sufjan talks about God in all of his songs:

god_words = ['christ', 'jesus', 'lord', 'god']
god_count = sum([fdist[word] for word in god_words])

For the whole body of lyrics I was analyzing, direct references to God appeared a whopping 187 times in 69 distinct songs. In the absence of a comparison with a non-Christian artist, I don't want to draw any direct conclusions from this, but I think it's fair to say that Sufjan revisits this theme a lot.

We can also see what other things Stevens talks about a lot in his songs by plotting the most common 50 words in the frequency distribution.

fdist.plot(50)

Word frequency in Sufjan Stevens songs

Unsurprisingly, 'love' is the most common noun across all songs. Perhaps more surprisingly, 'Christmas' is extremely common too - that is, until you know that Sufjan Stevens has released two Christmas albums.

We can go into a bit more detail, and look at the songs that feature God the most frequently:

god_list = []
for song in songs:
    tokens = nltk.word_tokenize(song['lyrics'])
    tokens = [token.lower() for token in tokens if token.lower() not in stop_words]
    fdist = nltk.FreqDist(tokens)
    god_count = sum([fdist[word] for word in god_words])
    god_list.append((song['title'], god_count))

# filter out songs with no reference to god
god_list = [song for song in god_list if song[1]]

# sort by number of references
god_list = sorted(god_list, key=lambda tup:tup[1], reverse=True)

Here then, are the top 10 Sufjan Stevens songs by Godliness:

Come On! Let's Boogey to the Elf Dance
Away In A Manger
Holy Holy, Etc.
Get Real Get Right
Oh God, Where Are You Now? (In Pickerel Lake? Pigeon? Marquette? Mackinaw?)
Seven Swans
Hark! The Herald Angels Sing!
Ding-a-ling-a-ring-a-ling
Once in Royal David's City
The Transfiguration

Aside from Christmas songs, the album that appears the most is Seven Swans, with two tracks ('Seven Swans' and 'The Transfiguration') featuring in the top 10. This makes sense, as the album is acknowledged as one of the more overtly religious of his works.

I hope you've enjoyed this little adventure into data collection and analysis using Python, with a distinct Sufjan Stevens theme. Thanks go to the singer himself, as well as Genius for providing the data for this post.

Quantifying Sufjan Stevens with the Genius API and NLTK

July 22, 2016

Tags