I am searching for a framework or library to make good use of the cloud to distribute heavy CPU-bound algorithms (in C++). The library should have right concepts to simplify the code and increase its robustness.
- Takes 3s to 10h to execute on a single computer
- Is submitted on the fly, and must run in parallel of the other jobs
- Uses 0 to 5 read only-files (takes non trivial time to load) which can be quite large and that I cannot split. There can be many files but each job only uses a few
- Can be partitioned into smaller tasks, that are run with a worker pool
The big files prevents me from using a simple load-balancer and dumb workers to run tasks, most of the time would be spent in file loading. Therefore, I am implementing cache at the worker level and I dedicate a subset of the worker pool to each job during its execution.
Could you recommend me a framework/library/tool that handles autoscaling, intelligent distribution of tasks, job & tasks management, error reporting, synchronization, retries, etc?
So far I have seen:
- Kubernetes with RabbitMQ routing (or equivalent message brokers), with hand-made job life-cycle management, task repartition and scaling
- Apache Storm, which is interesting but seems to use only java workers (I would rather use language agnostic workers)
- Hurricane (Storm ported in C++); doesn't seem feature complete and reliable
- Spark, Hadoop: seem to apply algorithms on distributed big-data, rather than splitting algorithm and apply each part on the job specific non-splitable data