The Phish Set Cover Problem

A little while back I was involved in a casual conversation. As they often do, this one led to a reference of a specific Phish song. Our next logical step was to attempt to listen to said song. Unfortunately, despite a fairly large category of Phish shows, this specific song was not in the library. We had no internet connectivity, so we were left only what was in the catalog for the rest of the night.

This led to an inevitable question: What is the list of shows necessary for you to have at least one copy of every song?

Many people would simply walk away from that night never knowing the answer. I couldn’t help but give it a shot.

The Problem

To reiterate the problem: Find the shortest list of Phish shows that would contain at least one of every song.

Fortunately this is a problem that can be broken down to several, more manageable tasks. Or at least I had thought so, but I’ll get to that later. The most obvious steps were to:

  1. Get a list of every Phish song
  2. For each song, find the list of shows where that song was played
  3. For each show, find the setlist
  4. Find the shortest list of shows that contains every song

Of course, I underestimated the magnitude of the factorial function. That essentially rendered #4 impossible to solve, but I’m getting ahead of myself.

Getting the list of every song

I’m not accustomed to writing web scrapers, so I was hoping there’d be some API to the database. They do have an API, but it seems to be for web developers through Javascript. After poking around there a little bit, I decided I’d have to go a more traditional route and go the scraper route. Fortunately with programming, a lot of the time you can simply use someone else’s code and start from there.


I found jsoup, a Java HTML parser. This seemed to fit all of my needs. First, it has an online interface where you can test code without having to implement everything up front. Second, its free to use!

So you can simply tell jsoup to pull the source from any website. Since we’re looking for a list of every song Phish has played, the most obvious (and convenient) site to use was This gives the screenshot below, where the input and parsed output give clues as to what you’d be looking for.

Songs webpage parsed by jsoup

CSS Queries

If my aversion to web development hasn’t been obvious, it will start to become clear here. jsoup allows CSS Queries. I haven’t ever used CSS, but due to the user-friendliness of jsoup, I was able to quickly test queries. With a little intuition and some luck, I figured out that every song I wanted to look at was given the following line:

<tr class="all originals">

Assuing I’m only interested in originals, then I can use the CSS query of .originals in order to filter out everything that I don’t need.

Parsed by CSS class

Now I feel I’m getting close! I have been able to isolate the songs tagged as “originals” by the folks. However, all of the additional links seem to muddy up the data – I don’t need to know “Alumni Blues Phish 106 1985-03-16 2016-07-20 29” I simply want “Alumni Blues”. A little more tinkering with CSS queries and I came across the “:first-of-type” argument. If I pull the first link (HTML type a) I get what I want.

Filtered to the first hyperlink

Halfway there!

Java Implementation

Now that I have everything working form the web interface, it’s time to look into the Java implementation. Essentially two magical lines of code will be able to pull an entire page and filter out the necessary “elements”.

Document document = Jsoup.connect("").get();
Elements elements =".originals :first-of-type a");

With those two lines, there’s an object elements that can be used to loop over every song on the site! With three more lines of code (and two more classes that I won’t discuss) the entire list of songs can be created.

for (Element element : elements) {
	Song song = new Song(element.text(), "" + element.attr("href"));

After all that, we have a list of song names and, more importantly, URLs to information about the song.

Getting the rest of the data

I had alluded that the important part of the list of songs was actually the URL of the song. Using the same jsoup example from above, only instead of searching for the “.originals” class, the “.etpitem” is needed. With a little more code, you’ll end up with a list of all shows and using the “.setlist-song” class you’ll be able to pull all songs from every show.

After all of this, with a little bit of Java trickery, all the data can wrapped up in a couple of lists. This can be visualized by a simple mapping image.

Song to show mapping

Each song would be in the left box. Each arrow coming off each song indicates that song has been played in specific shows. The list of shows would be in the right box, and the arrows leading into them could be thought of the show’s setlist.

At this point everything related to scraping and organization is complete. The only thing left to do is simply solve the “minimum show list” that covers all of the songs.

The Algorithm


In a lot of these projects I find myself leaning toward recursion whenever I can. Since I’ve been in the embedded world most of my career, I feel like recursion is often bad practice (if not explicitly forbidden.) This offered an interesting problem where a recursive brute-force method made sense…

Or so I thought…

The logic of the algorithm can be seen in the flowchart below. Not the best flowchart I’ve made, but deal with it. Hopefully the link stays active after my 1-week gliffy trial expires.

As it turns out, small numbers can become REALLY big!
Recursive Algorithm Flow

There were a couple of intuitive tricks that I had used in this algorithm to help speed things up.

Early Exit

Near the top of the flowchart, I have what I called “Early Exit.” It is essentially a simple check to see if the number of songs left is possible to be covered by the number of shows left. For instance, if I have 100 songs left and only 2 shows left, I know that I won’t be able to cover all of those songs since Phish only played about 15-20 songs per show.  In that case, I can exit the current processing stack. I called this an early exit, since I was able to kick out of the routine very “early” in the flowchart. Hopefully that’d speed up the selection algorithm quite quickly.

Sorted Selection

Another intuition I had was that I could start with songs that had been played only once or twice, and select those shows first. That is part of the “Least Played” part of the “Find Least Played, Unselected Song” step. Essentially if a song was only played once, it has to be in the minimum list of shows. So let’s select that show first. That way we don’t have to loop through the 597 times (at the time of writing this) that You Enjoy Myself has been played.


The code to do all of this was probably easier to make than the Gliffy flowchart:

private static boolean chooseShow(final int numShowsToSelect) {
	boolean rval = false;
	// Check if we can exit early
	if (numShowsToSelect * maxSongsPerShow < numSongsLeftToSelect) {
		return false;
	Song nextUnplayedSong = SongUtils.getNextUnplayedSong(songList);
	if (nextUnplayedSong == null) {
		// The "we're done" return
		return true;
	} else {
		List currentShowList = nextUnplayedSong.getShowList();
		for (Show currentShow : currentShowList) {
			int numSongsAdded =;				
			numSongsLeftToSelect -= numSongsAdded;
			if (	numSongsLeftToSelect == 0
					|| chooseShow(numShowsToSelect - 1)) {
				return true;
			} else {
				int numSongsRemoved = currentShow.unselect();
				numSongsLeftToSelect += numSongsRemoved;
	return false;

for (int numShowsToSelect = 1; numShowsToSelect < songList.size(); numShowsToSelect++) {
	long startTime = System.nanoTime();
	if (chooseShow(numShowsToSelect)) {
		// Success!
		System.out.println("Successfully got it with " + numShowsToSelect + " shows!");
	long endTime = System.nanoTime();
	long duration = endTime - startTime;
	System.out.println("Tried with " + numShowsToSelect + " shows - took: " + (duration / 1000000000.0) + " seconds");



By the time I hit 57 songs the algorithm was taking 20,000 seconds and increasing by a factor of 5 with every iteration. As it turns out, small numbers can become really, really big. There are about 10 million concert combinations with songs that were only played 5 times. And there are only 10 songs that were only played 5 times. In a true brute-force search, each of those had to be considered. Next.

Breadth First Search

For some reason, I had thought that the Breadth First Search (BFS) would have solved my issues from the recursive strategy.  The reason being that each time I had to go “one deeper” the algorithm basically had to start over from scratch. This is a benefit of the BFS: It remembers the stack of operations that got you there, so that all you have to do is add each step to the processing queue. Wikipedia has a good enough explanation of BFS, so I’ll leave that as an exercise to the reader.

To explain this in further detail, I describe the final logic of the recursive algorithm as such:

  1. Go to the next “Show”
  2. Select each song from the show (add 1 to the play count)
  3. If every song is selected, we’re done
  4. If every song is not selected, unselect each song from the show

It can be see that we have to loop through each song, for each show. Not good! However, only one copy of the SongList is kept in memory, with their counts updated.

In the case of BFS, we’re using a processing queue. This means that every time we want to consider a new show, we have to copy the entire song and selected show arrays. So essentially the logic is:

  1. Pop the next Show and UnplayedSongList off the Execution Queue
  2. Remove each song played at the show from the UnplayedSongList
  3. If the UnplayedSongList is empty, we’re done!
  4. If the UnplayedSongList is not empty:
    1. Find the next unplayed song
    2. Add a copy of the UnplayedSongList and the next Show to the end of the Execution Queue

This led to a LOT of memory usage, since I had to make a copy of each selectedShowList, nextShow, and unplayedSongList for every show permutation. Further, the lookup from Show to Song was VERY quick in the recursive search method. Since these are copies of objects and not the objects themselves, there has to be an actual search for each song. NOTE: This could be done in O(logn), but even that is worse than O(1).

Long and the short of it

The recursive algorithm seemed to be much better than the BFS in this instance. It didn’t seem to take up much memory, though there were two loops over each setlist to mark / clear each song.

The BFS could be optimized so that I’m not copying the entire song list (URL, title, etc.) but I’m not confident that’ll work out either. There are simply too many possibilities.

Greedy algorithm

Here’s the sure fire way to come up with “A Solution.” It essentially involves the same process as the Depth First Search algorithm, but instead of trying every combination, it selects the “best” show it can find. The algorithm can simply be outlined as:

  1. Find the song that has been been played the fewest times
  2. From that song, find the show that selects the most unplayed songs
  3. Select that show, and mark all those songs played

Doing this guarantees an answer (if the problem is solvable) but is not guaranteed to give the “best” answer.


private static void doGreedyRoutine() {
  Song nextUnplayedSong;
  while ((nextUnplayedSong = SongUtils.getNextUnplayedSong(songList)) != null) {
    List<Show> currentShowList = nextUnplayedSong.getShowList();
    if (currentShowList == null) {
    double showWeight = 0;
    int optimalIndex = -1;
    for (int i = 0; i < currentShowList.size(); i++) {
      Show currentShow = currentShowList.get(i);
      double tempShowWeight = currentShow.getShowWeight();
      if (tempShowWeight > showWeight) {
        showWeight = tempShowWeight;
        optimalIndex = i;
    if (optimalIndex > -1) {
      Show selectedShow = currentShowList.get(optimalIndex);;
  int i = 0;
  for (Show showDisplay : showList) {
    if (showDisplay.isSelected()) {
      System.out.println(i + ". " + showDisplay.toString());


We got something! The problem can be solved in at least 71 shows (as of the time I had scraped the data)

This algorithm takes milliseconds to run, whereas the others took days and failed.


Next Steps

I’m saddened that I wasn’t able to get the “Absolute” solution down. I’m not yet convinced that it can’t be done, although I have done the math that suggests it probably can’t. That’ll be another post.

The next steps toward getting a better solution (I know for a fact that it can be done in fewer shows) are to REALLY optimize the search algorithm. This would mean running trials in parallel and trying to minimize the number of clock cycles per trial.

The final step would be to port the “optimized” code to an FPGA that has a lot of horsepower and see if a pipeline can be set up to really run through trials. Initial “Best Case Scenario” calculations suggest that it might take about 19 years of computing time, assuming one billion operations per second. So getting that down by a factor of hundreds is quite important. Could lead me to my first FPGA project…

Leave a Reply

Your email address will not be published. Required fields are marked *