The Phish Set Cover Problem, Pt 2

After a little time and a couple of attempted optimizations, I feel it is time to take a step back. This post will be more analytical, specifically addressing the question “Is this even possible to solve?”

How big is this problem?

At the time of the data collection, there were 291 unique Phish songs and 1,622 Phish shows. Throughout the 1,622 shows, a total of 25,169 songs were played, for an average of 15 songs per show.

So, now we have 15 songs played per show. Let’s use this to say that if we’re “visiting” a show, it will take at least 15 operations to “mark” each song as “visited”

Terminology

I apologize, but feel I must adjust my terminology. Phish fans will say “I’ve been to X shows and never saw a Fluffhead.” I plan to use this type of mentality rather than the iPod mentality I started with. So from now on, the main question will be “What is the list of Phish shows I should have attended in order to see at least one of every song?” I feel it lends itself better to “attend” shows and “see” songs. It has more of a je ne sais quoi than “installing” shows and “having” songs.” The term I used earlier of “visiting” comes more from algorithm studies, and feels too generic.

So now we’re as cheap as we can be. We want to buy as few tickets as possible in order to see every song Phish ever played live. I’m sure you can relate!

Back to it

So it takes 15 “operations” per show. This is essentially taking your book of songs and marking “yep, I’ve seen that song” once for every song. This is “seeing” a show.

Since I’m currently doing a recursive algorithm, if I see that attending the show won’t help me solve the problem, I’ll have to un-attend the show. This’ll be another 15 operations to say “nevermind, I haven’t seen that song.”

So we can essentially say that we’re at 30 operations, for every show we want to try to attend. Great!

How many shows do we need to attend

So the greedy algorithm solved the problem in 71 shows. I know it can be solved with a little modification in 70 shows. I know the problem cannot be solved in 60 shows. My guess, based on no other information whatsoever, is that it’ll probably take about 67 shows.

An assumption that can be made is that we’ll only need to consider the 67 fewest played songs. Call this “Assumption X.” This is a very strong assumption, and every time this assumption is violated in practice the number of required operations will multiply. We’ll still make Assumption X for a baseline.

So, looking at the first 67 songs, we can see how many times each of them were played. By my count, there were 36 songs that were only played once. There were 17 songs that were played twice. Ten songs were played three times, and 11 songs played 4 times.

So to get to our estimated 67 shows, we need 36×1, 17×2, 10×3, and 4×4 songs vs the number of times played.

We don’t really care about the 36 songs that were only played once. We know for a fact that each of those shows are in the list of tickets we need to buy. So let’s look at the songs that were played twice.

Seventeen songs were played only twice. Let’s start by looking at two songs: Song A and Song B. Song A was played at Show 1 and Show 2. Song B was played at Show 3 and Show 4.

If we want to see both Song A and Song B, we’ll have to attend either:

  1. Show 1 and Show 3
  2. Show 1 and Show 4
  3. Show 2 and Show 3
  4. Show 2 and Show 4

To extend this to a third song Song C (played at Show 5 and Show 6), we’ll need to double the number of combinations to check.

  1. Show 1, Show 3 and Show 5
  2. Show 1, Show 4 and Show 5
  3. Show 2, Show 3 and Show 5
  4. Show 2, Show 4 and Show 5
  5. Show 1, Show 3 and Show 6
  6. Show 1, Show 4 and Show 6
  7. Show 2, Show 3 and Show 6
  8. Show 2, Show 4 and Show 6

The astute reader would notice the list size doubles every time another twice-played song needs to be checked, and that is precisely the case (in the best-case scenario.) So after the 17 different songs played twice, we have 2^17, or 131072 different combinations. This is a cakewalk for any computer to test.

However, we’re only at 53 shows. We guessed we have to get to at least 67. Each of the thrice-played songs triples the number of combinations played. So to get to 63 songs, we’ll have to test 2^17 * 3^10 combinations, or 7,739,670,528 different show combinations!

Seven billion still isn’t a huge number for a computer. Assuming an algorithm that can run around 1GHz (one billion operations per second) we’re only at 7 seconds worth of operation time attending shows. Of course, we still have to perform 30 operations to mark each song as seen.

The last four songs were played at least 4 times. So we’re at an absolute best combination number of 2^17 * 3^10 * 4^4, or 1,981,355,655,168, or about 2 trillion combinations.

Best case scenario, at 30 operations per show attended, 2 trillion combinations, and 1 billion operations per second, we’re at about 60,000 seconds of computation time. 17 hours.

Wait A Second…

“In your example, you said Song A was played at Shows 1 and 2, and Song B was played at Shows 3 and 4. What happens if there is some overlap?”

This type of scenario doesn’t affect the assumption that we need to see 67 shows to see every song. So in the case that Song A and Song B were played at the same show, we’ll now need a four-time-played song to take the place. So our 17 hours just became 34 hours. (2^16 * 3^10 * 4^5; * 30) Nice try though. I’ll get to how quickly this blows up in practice shortly.

Pruning the tree

Image result for edward scissorhands clipping gif

After all of that data collection, there are still a lot of songs that were only played once. For instance, the song Fluff’s Travels was only played one time. (Whether or not it is due to shoddy record keeping of a partial show recording in the late ’80s is not something I’m going to discuss.) So since we know that we’ll have to include that show, why should we even care about, say, Bathtub Gin?

This is exactly what I’ve done here: (or at least tried to do…)

    private static int selectAllSinglePlayedSongs() {
        int numShowsSelected = 0;
        
        for (Song song : songList) {
            if (song.listSize() == 1) {
                Show currentShow = song.getShowList().get(0);
                if (!currentShow.isSelected()) {
                    numSongsLeftToSelect -= currentShow.select();
                    numShowsSelected++;
                }
            }
        }
        
        return numShowsSelected;
    }
    
    private static void pruneShowSetlists() {
        for (Show show : showList) {
            show.pruneSongList();
        }
    }
    
    private static void prunePlayedSongs() {
        Iterator<Song> songIter = songList.iterator();
        
        while(songIter.hasNext()) {
            Song currentSong = songIter.next();
            
            if (currentSong.isSelected()) {
                songIter.remove();
            }
        }
    }

// Select all of the single-played shows
// Select all songs covered by those shows
int numShowsSelected = selectAllSinglePlayedSongs();
        
// Remove all those songs from the show setlists
pruneShowSetlists();
        
prunePlayedSongs();

After running this routine, I’ve gone from a list of 291 songs played a total of 25,169 times down to a list of 102 songs played 2,177 times. This removed 27 shows. So now to hit the “magic” number of 67, I only have to find 40 new shows. This should be easy now…?

Analysis of the new tree

Things haven’t actually changed too much.  In fact, I think they’re “worse.” Here’s why:

On the initial analysis, I’d assumed that all 37 songs that were played only one time would count toward the 67 shows needed. In fact, only 27 shows were needed. So we’ll need 40 shows from the new distribution. Before pruning this was:

34 songs have been played 1 times
16 songs have been played 2 times
11 songs have been played 3 times
10 songs have been played 4 times
11 songs have been played 5 times

After pruning it looks like this:

15 songs have been played 2 times
8 songs have been played 3 times
7 songs have been played 4 times
11 songs have been played 5 times

And while our first solution was 2^17 * 3^11 * 4^5 (plus the 34 single-played shows for a total of 67) it is now 2^15 * 3^8 * 4^7 * 5^10 different possibilities… in the best-case scenario. This is about 3.5 * 10^19 different combinations! Written out as 35,000,000,000,000,000,000. With one computer doing one billion tests per second, it would take 3.5*10^10 seconds. That is about ten million hours of computing time. One thousand years.

One last thing to note: The measured speed of the program is 3*10^7 tests per second, not 1 * 10^9. So there are a couple more orders of magnitude needed to get squared away.

Current Metrics

Currently I’m testing about 15 million combinations per second. Testing combinations of 60 shows took two minutes. 61 shows took 13 minutes. 62 shows took 106 minutes, and 64 shows took about 1000 minutes.

This rate of growth is to be expected. Each extra show is taking about a factor of ten longer than the previous step. At this rate it seems like testing 65 shows will take a week. 66 shows would take two months, and 67 shows would take two years.

Two years? Why isn’t it one thousand years?

I’m still running my early exit routine, which tries to stop going down paths that seem to be failing. Also, some combinations are probably not possible, though I’ll have to look into that more. Lastly, I was assuming that the factor of ten would continue to be the grow rate. This would likely not be the case, as the early exit routine would start allowing more paths. Also, we’d see more of the cases where the “ideal” songs would have already been selected, which we saw was a bad thing. So maybe two years is a little nice of an estimation.

Next Steps

The next thing that I’d like to do is get some visualization tools in the problem. Something similar to Graph Stream seems to make sense, and is probably easier to visualize than something like graphviz .

The other thing I’d like to do is get the recursive algorithm running in parallel. This involves the fairly easy problem of being able to create copies of the tree. The tricky part would be to create an inter-process communication to say “You try that path, I’ll try this path.” I feel like there will probably be enough in that to merit its own post.

To be continued…