I am bringing here comments from a related ticket that I'll close (DM-696).
Empty chunk list is currently installed in build/dist/etc, and it is "global", e.g., there can be only one such list at any given time. That prevents us from talking to more than one database, plus, switching to a database with a different empty chunk list can result in very confusing messages. We should at minimum fix that.
Further thoughts from Daniel are pasted below.
Ideally, we would rebuild it upon any changes to the dirTable. There was code in indexing.py(I think?) that was a placeholder attempt of code that could generate it.
There are a couple ways of generating it:
- from the range of numbers defined by the min and max chunk number, filter out chunks determined to be non-empty by the existence of dirtable_NNN tables. This is what my scripts did, and it was a hassle to get it to work.
- Do a special all-chunks query (count
or similar), but don't squash on errors-- add them to the empty chunks list.
- Populate the empty chunks file when you create a database, or when a czar becomes aware that a database exists. Every time you load data, you know what chunks you are creating, so remove those chunks from the empty chunks file/list. The czar always checks the db entry in css to see if its empty chunks file is out of date. It is impossible to delete partitioned rows, so chunks never become empty after being non-empty.
- Compute a non-empty chunk list from the sec-index list: select distinct chunkId from blah (will go obsolete soon) and create a hash-table or std::map, and use it. The czar can compute and cache this the first time the db is accessed.
And comment made by Daniel Apr 27, 2015:
My current thinking is that we should build the better secondary index, and auto-manage it and then build the emptyChunk list from that and cache it locally.
Provided a secondary index is available, we can create a list of populated chunks by a O(n logC) scan of each secondary index (where n=number of director table rows, and C=number of non-empty chunks), and then a O(E logC) iteration over the possible chunk numbers checking the populated chunks (tree-indexed, C elements). n is in the billions, so it's too expensive to run upon czar startup.
Actually, provided the workers are up, you should be able to generate the set of available chunkids merely by asking xrootd for each chunkId. This happens in constant time (about 30 seconds-5 minutes?) because the biggest cost is just waiting for the xrootd redirector to timeout for the chunkId.
The empty chunks store should have less records than the number of possible chunks (max 100k or 1M), so having one store per db (or partitioning group) can use any persistent key-value store (e.g. leveldb or tokyo-cabinet).
In the shorter term, rocksdb ( https://github.com/facebook/rocksdb ) makes a lot of sense to store the secondary index. It was built to address bigger-than-memory issues in leveldb, which was designed as a lightweight persistent key-value store. It does compression too.
We can build the secondary index off a live dataset by issuing select keyColumn, chunkId from db.dirTable through the standard czar execution mechanism, with specialized result handling to insert into the index rather than a result table. This may be an easier solution than having the loader handle it.