Compass Blues

In my current project at work we’re using the compass framework to meet our searching needs. All in all it’s a great framework with lots of support to specify your search indexes and neat integration with Hibernate/Spring (which we’re also using). Besides the wrapping of lucene in order to simplify the searching, it also comes with a JdbcDirectory implementation of the lucene Directory to store indexes in the database. In order to make sure that the index is always in sync with the data in the database, this is really cool, and you only need one backup for everything.
However….. the last couple of days we were running into serious problem with the JdbcDirectory implementation. At random times during the indexing etc. the directory would throw SqlExceptions with the following error message (using PostgreSQL):

ERROR: Invalid Large-Object descriptor: 1183

and the whole transaction would come to a grinding halt. With the system going to test, these errors were becoming more and more serious and I’ve spend days (literally) together with my collegue tracking down the problem.

When databases start throwing these kinds of exceptions, it generally means that you are trying to acces the Blob within an invalid transaction context. Therefor I checked all our settings concerning transactions within spring, and it turned out that we had indeed made a mistake there which caused the compass process to not correctly participate within our spring transaction management.

Breath of relieve…..

But, after a brief elated feeling, the errors were starting to occur again. NOOoooooo, this was turning out to be a major pain, I started to believe that the problem might somehow be related to postgres blob logic, which might be bugged. It’s really amazing how little info you find on google if you search on the error message and the information that you do find turn out to be irrelevant. Then I spotted in the PostgreSQL blogs that the autovacuum process was logging during the same time as the error message above. AHA, that must be it (you’re getting this, right?), I turned off the autovacuum hoping that it was somehow responsible for the problems we faced. Started the test again and it seemed to be running longer….. CRASH, NOOOooooo. Well that wasn’t it.

Today, I finally cracked the problem, and it was with compass. Compass uses an algorithm to cache blobs during a transaction (don’t ask me why I did not spot this sooner) and it’s broken if you’re using spring transactions. You see, when integration with a Spring managed transaction, the documentation specifies that in order for the JdbcDirectory to be ran in the same transaction as the Spring managed ones that Hibernate etc. use, you have to use the TransactionAwareDatasource from Spring. This datasource deals out connections using a ConnectionProxy, which is a new Proxy around the current Connection that holds the transaction. Two calls of getConnection don’t return the same Proxy, however, they return a new instance.

The caching mechanism for blobs actually assumes that the connection object is a good way to determine the keys for your blob cache, but when the connections don’t have an equality match this will not work, meaning that you frequently don’t get a cache hit on your blob. Besides from being a performance hit, it’s also a problem because chances are high that at the end of the indexing operation the blob cache is not cleared, but remains in cache. From here on out it’s just a matter of time before a connection a few transactions later on does match the cache key and old blobs get returned from that cache, which have an invalid handle by now, since the originating transaction was already commited.

I’ve patched it, not very pretilly, in our own compass version (using 1.1 by the way) within the DataSourceUtils, but I’m hoping that this will get fixed pretty soon within the compass framework as a whole. I can imagine other JdbcDirectory users also running in to this problem. Right now I’m just really happy I found the problem, because I was starting to get to the point were we had to decide to stop using it and use a different (file based) sollution. Yah!!!!