A possible workflow for loading a large catalog into Qserv may require various forms of parallelism both at the data preparation stage and when performing the actual loading. This opens a possibility that the actual data loaded stage may be operating on fragments of the same chunk spread between multiple input files. The current implementation of the database loading procedure is not safe to operate in the input data structured in this way. Neither it would work for subsequent loading fragments of the same chunk. There are two failure modes which would prevent the current procedure from succeeding:
- chunk-to-worker node mapping: each instance of the loading script will try to follow its own round-robin allocation sequence of worker nodes for chunks as they (the chunks) are discovered in an input folder. Since no coordination has been employed so far between multiple instances of the loading script two instances may try to allocate different worker nodes for the same chunk. This would be a problem when an input data contains chunks split into fragments.
- chunk creation: the current implementation of the loading script is based on an assumption that the loading is always made on a fresh (empty) instance of Qserv. As a result of this the script won't check if a chunk structure already exists on the corresponding worker node. This may cause the script to fail when attempting to add more data to an existing chunk.
A solution to the chunk-to-worker node mapping problem would be to pre-populate the CSS database with the mapping before loading any chunks into Qserv.
A solution to the second problem (of the chunk creation) is to reinforce the implementation of the loading procedure to detect an existence of a chunk on a worker node and not to fail if the one already exists (was created earlier by some other instance of the loader run either in parallel or before this one).