Interview: The Creator of “The Largest Vocabulary in Hip-Hop” Breaks Down His “Consistently Inconsistent” Chart

A conversation with Matt Daniels, the creator of the Hip-Hop Vocabulary infographic.

May 09, 2014

"I am not a data scientist at all. This would get an 'F' if you were at a data science school."

Early this week, an infographic ranking 85 rappers by the size of their vocabulary went viral. The graph was interesting mainly for its surprising results: many of rap's most celebrated lyricists ranked fairly low, while talented yet more obscure MCs like Kool Keith and Aesop Rock topped the chart. Though heavily discussed amongst techie crowds, many rap fans were left to ponder what the graph really meant. So we spoke to its creator, digital strategist Matt Daniels, about Reddit, Rap Genius and why rap music continues to make for such attractive web content.

What’s going on, how is everything? Things are good, I am traveling right now so it’s a bit hectic. I’m actually on a sabbatical right now. I’m in Austin, so I’m more or less very far away from home base in New York City. I’m doing this for another six weeks, so it’s pretty fun.

Like a sabbatical from teaching? It sounds cooler than it is. I work for a consulting firm in New York City called Undercurrent, and it’s like a leave of absence basically, a little bit of a break from working.

What kind of work does Undercurrent do? So Undercurrent is a digital strategy firm. Like an agency, just no developers of designers. The focus of the company is organizational design: making large traditional legacy companies look like the start-up technology companies that you see today. It’s a pretty cool job, but sometimes you can’t flex creative ambition, so that is what the sabbatical is all about.

What findings or key ideas do you think come out of a project like this? Honestly, a lot of people have asked that, and I don’t think that there is one. I think of it as more of a data point, like height. This is the vocabulary of these rappers, take from it what you will. I think throughout the piece I try to avoid making any judgments of quality, like this rapper is better than that rapper. It’s just an interesting topic for discussion.

What are some of the discussions that you’ve seen pop up? Which rappers are excluded. And a lot of that has to do with fanboyism, and whether people are slighted by not seeing their favorite rapper on the graph. The number one rapper in this case, Aesop Rock, was hypothesized by the Reddit community to be number one. I excluded him at first because I figured it would detract from the graph, because I wanted it to be easy to digest for a mainstream hip hop fan, but when I ran the numbers he was so far to the right on the axis it was a slight not to include him. Another thing that I saw brought up was on io9, owned by Gawker. A writer there, Robert Gonzalez, wrote about this quote from Jay Z’s “Moment of Clarity,” where he talks about his vocabulary choice, why those stylistic choices matter in terms of how his fans receive his content, or receive the songs, so that was interesting.

“I dumb down for my audience to double my dollars.” Right. So Talib Kweli is considered to be much more complex in his word choices, and intellectual, so Jay was making a point there, in that quote. But this is io9 writer Gonzalez’s point, not mine. He drew this insight, I found it insightful and added it in, so it’s definitely not a conclusion that I’ve made.

It’s funny, Jay Z lands dead center on the graph. Take Outkast. They’re from the south and use a lot of southern slang. Your not going to find as much uses of slang for some East Coast or West Coast artists. Where they’re from and their style of rapping is obviously going to have an effect on their vocabulary. That was validated by the data.

So southern rappers have more variety to their slang? No, its just Outkast specifically. Lots of made up words, lots of southern drawl. It's just their word choices are extremely unique.

To have names like Drake and Kanye and Tupac and Wayne and T.I. and 50 and Cam’ron rank so low—these are dudes that are known for being very inventive with language. I guess that’s why I asked what the takeaway was, because this isn’t how people who typically discuss these artists would rank them. No, no, nope, no takeaway. It’s getting academic about something that’s not academic. Like when sports data became big, that was cool, even though some of it was probably irrelevant. And you can be inventive and not have the number of unique words be higher or lower. Eminem and DMX’s styles are different and unique. DMX is one of the first artists to have three albums go platinum. Just because he has a low vocabulary doesn’t affect how inventive he is. You can be inventive without using an SAT word in your rhymes. It comes down to style. Just one extra data point to discuss.

Why did you chose Shakespeare and Herman Melville as benchmarks? I am not making any sweeping judgments about hip-hop versus Shakespeare. In fact I don’t bring up Shakespeare at all in the whole entire article, besides as an intro or a prompt. It’s really just an interesting way to hook the reader from a content perspective. People say Shakespeare has arguably the largest vocabulary ever. Lets run a data analysis on Shakespeare versus other rappers using more or less the same methodology and see how the data pans out. It will hook the reader if you’re a literary fanatic, or you love Moby Dick for example, and at least adds a little bit of context to what you are looking at.

There was a very specific methodology that you used, between controlling for the number of words and the number of albums, and how some rap groups are ranked together and how some are ranked separately. Is this scientific? How holistic can a study like this be? Zero holistic. There are so many biases that skew the data that I would have to spend a year cleaning this data to make it perfect. There are issues with transcription of how words are pronounced in Rap Genius. You have issues with repetition in choruses, you have issues with the artists—maybe these artists didn’t even write their own words. Like, “pimpin” and “pimping” are two different words, but maybe, maybe not depending on how the listener transcribed the word. It is consistently inconsistent.

So it’s more of like a rough estimate as opposed to any scientific kind of ranking. I am not a data scientist at all. This would get like an F if you were at a data science school. This is a personal project I made it in sabbatical time to test my chops in data visualization. I think it’s worth existing on the Internet, if only to encourage discussion.

What’s some of the rap that you listen too? I love Outkast. I appreciate them 10 times more after writing an article on them and really spending sometime looking at their evolution over time. I have been a huge fan of them ever since. Childish Gambino recently, Kanye, Jay Z, a lot of mainstream rappers. Actually, this whole project has exposed me to a lot of acts that I otherwise wouldn’t have spent as much effort listening to. I don’t even know if I am pronouncing their name correctly: El-P. I’ve listened to Deltron before, but Del the Funky Homosapien. Sage Francis is another omission that has come up a lot that people want to have included. Tech N9ne was excluded in the beginning, just because I felt he was too obscure for the audience that I was trying to write for. And there was so much uproar that I had to include him and I’ve been listening to his music ever since.

Biggie and Kendrick Lamar are also omitted here. So, your total vocabulary changes if you have more words to work with. If we took the number of unique words Jay Z used, the sample size would be huge compared to Kendrick, because Jay has over 12 albums. You have to use a sample size. Initially I picked 50,000 words, but several artists hadn’t reached 50,000 yet, like Drake. So I lowered the threshold to 35,000. I tried my best to include Biggie. I went through all his albums, pulled his verses out of Junior Mafia, tried to get him to 35,000, and I think I got to 32,000. I could have fudged the numbers, but I wanted to make sure it was still consistent from a methodology standpoint. I also only counted studio albums. A lot of rappers’ mixtapes are just stream of consciousness. I want to make sure that they feel that this is really representative of what their vocabulary entails.

You would say there’s a creative difference between albums and mixtapes? Hard to say, we could probably argue it both ways. I just wanted to be consistent. Lil Wayne had several mixtapes before any of his original material came out right? The other problem was that I got the data set from Rap Genius. Since mixtapes are underground they are less popular, and the transcription isn’t as accurate. So sticking with studio albums maintained the data, and the methodology, and for all those reasons you can’t include people like Kendrick Lamar just because they don’t have enough studio album material.

Was your choice to use rap lyrics as a data pool strategic at all? In order to get coverage on music blogs? I had the data from a previous project. I don’t have any other data, so I worked with what I got. People have said you could do this with country music, or rock. I don’t know how celebrated unique vocabulary is in those genres. That didn’t weigh into why I picked hip-hop verses another genre.

What do you think the rappers featured in the graph would think about it? Like I said, it’s a data point. It doesn’t mean anything, doesn’t mean you’re better than another rapper. I don’t think these rappers would notice or care. If they did, I’d be flattered. I’d probably ignore anything they thought and just be flattered they saw something I made.

Do you plan to continue data analysis with hip-hop as a lens? Yeah, if the right project presents itself. I still have all this data from Rap Genius. It’s getting older everyday, but I wrote to a lot of people on Reddit I’d keep expanding it.

Is there a large community of rap fans on Reddit? Yeah, have you used it before?

Not really, I don’t use it often. How do you write for The FADER and not use Reddit? Wait, seriously? You’re not big into Reddit?

Nah. Okay, seriously, as someone who does consulting in content, one of the things I learned doing this is Reddit is one of the most important pillars of how content spreads. Some of the Subreddit’s aren’t that great, but there’s Hip-Hop Heads, which is the dominant community for hip-hop on Reddit. Then Hip-Hop Production, Alternative Hip-Hop, Underground Hip-Hop, and they’ve all been super receptive to something like this.

Thanks for taking the time to talk, man. Thank you, happy to chat. Sounds like you’re going to have some time to spend on Reddit threads.

Per your advice, I gotta get up on it. I guess that’s where I should be getting my hip-hop news from. Not hip-hop news. But you write for The FADER, this is how articles spread. In fact, that’s probably what I’m going to write next. If you have any connection to content on the internet, I totally recommend getting up on Reddit.

Interview: The Creator of “The Largest Vocabulary in Hip-Hop” Breaks Down His “Consistently Inconsistent” Chart