Tuesday, March 01, 2005

The suicide checkers endgame database

When I computed the suicide checkers 6-piece endgame database, I was in for an unpleasent surprise: My compression program which worked fine for the regular checkers endgame database didn't do a good job at all compressing the suicide checkers database. While the regular 6-piece database is only about 40MB, the suicide 6-piece database is a full 150MB, nearly 4 times larger. Why is this? To understand why it doesn't compress well, you first need to know how the compressor works: it uses a scheme known as "run length encoding" or RLE for short. RLE replaces "runs" of the same value in the database with a pair of values: the length of the run, and the value of the run. If this sounds gibberish, an example might help. Let's say you have a way of enumerating all positions in your database from 1...N. You write down the value for every position and get something like

WWWWWWWWWWDDDDDDDDDDWWWWWWWWWWDDDDDDDDDD....

with W a win, D a draw and L a loss - none of them present in the above database. The compressor now works by storing (10,W), (10,D), (10,W), (10,D) instead. In many databases the runs reach amazing lengths, especially in lopsided databases where most positions are just a win for the stronger side. There are of course the positions where one side can capture something, and they usually have more or less random results. For this reason, positions with captures are just removed from the database; your program will have to look one ply deeper and look up the position after the capture. This results in nice compression ratios for regular checkers, the compressed database is approximately 10x smaller than the uncompressed database. Now you can probably guess why the suicide checkers database doesn't compress well: The results in those databases are much less monotonous, and therefore this compression scheme is no longer particularly efficient.
Again, an example might make things clearer: In regular checkers, all positions with 2 vs 1 king are wins for the 2 kings if you remove the positions with a capture, with only one rather pathological exception; red kings on 3 and 12 and a white king on 11, with red to move. Therefore, this database compresses exceptionally well. In suicide checkers, 2 vs 1 king is normally a win for the side with 2 kings, however, if you played through the game I posted yesterday, you can see that in the endgame red could get 2 kings vs 1, but he would lose all the same. Therefore, the 2 vs 1 king database in suicide checkers has wins as well as losses in it, and doesn't compress well. On another note, this also means that the database is more important in suicide checkers than in regular checkers: Obviously, it would be quite difficult to write a heuristic that can guess the likely outcome of the game with 2 vs 1 kings on the board, something which is very simple for regular checkers. And of course for more complicated endgames, it gets even more difficult.

7 comments:

Martin Fierz said...

Thanks for the correction on the 2K-1K regular checkers, george. Once again I show my profound checkers knowledge :-)

About the reordering of the database: the regular database is made more compressible by removing all positions with a capture on the board. Taking into account your proximity idea, one might remove all positions in the suicide database where there is too much proximity. However, I'd be afraid that this removes a very large part of the database!

Ed Gilbert said...

Martin, I tried Huffman encoding on the 9pc db and it gave better compression than RLE. I am not presently using it because uncompressing is slower, but maybe for your suicide db it might be worthwhile if you can squeeze enough size out of it. You could try WinZip on your compressed db, and if you get a lot of additional compression then that would be an indication that Huffman compression might work well.

-- Ed

Martin Fierz said...

Ed, how much better did the Huffman encoding work compression-wise for your 9-pc database? And how much more expensive was it to decompress?

Martin, curious

Ed Gilbert said...

I did this about 8 months ago. The improvement was significant, IIRC around 25% smaller than RLE. I wrote the compressor but never finished writing the uncompressor. For each one of my various ideas it became obvious that the uncompressing speed would be more than 25% slower than with RLE. I probably could have got the speed back by indexing at smaller intervals than 1k, but then that would make the index data consume more ram. This is a problem for a 10pc db, but is no problem for an 8pc db or smaller. Your db is 4x larger than for normal checkers; I wonder if its because there is actually 4x as much 'information' content, or is it just that the particular RLE scheme we use for normal checkers doesn't do a good job on this information? If you got significantly better compression ratios with WinZip on your suicide db than on the normal checkers db, then Huffman encoding might be worth trying.

Martin Fierz said...

Ed, I compressed both the regular checkers 6-pc database and the suicide checkers database with Winzip. The result:

Regular:
From 38'545 KB down to 21'513 KB (-44%)

Suicide:
From 146'425 KB to 89'068 KB (-39%)

I'm not quite sure whether this is relevant, shouldn't I try compressing the original uncompressed database with Winzip?

Ed Gilbert said...

Martin, yes I agree that if you have some uncompressed files to test with that would be more relevant. Based on what you found so far, it doesn't suggest that Huffman encoding will give you any more benefit than what I got with a normal checkers db. If all you can hope for is ~25% improvement then Huffman encoding is not worth it because of the uncompressing speed problem and all the new coding that it requires.

Martin Fierz said...

Hi Ed, I tried compressing the 3K vs 3K database with WinZip, it managed to compress the 4425 KB down to 2669. My RLE compressor compressed the same database to 1777KB. I suppose the difference comes from the captures which I remove in the RLE compressor, that's missing for WinZip of course. So I can't do the experiment without writing a new program to output an intermediate uncompressed file with captures floated, and I probably will be too lazy for that :-)