Cloud compatible frameworks for CPU intensive computing

Hello,

I am searching for a framework or library to make good use of the cloud to distribute heavy CPU-bound algorithms (in C++). The library should have right concepts to simplify the code and increase its robustness.

Each "Job":

  • Takes 3s to 10h to execute on a single computer
  • Is submitted on the fly, and must run in parallel of the other jobs
  • Uses 0 to 5 read only-files (takes non trivial time to load) which can be quite large and that I cannot split. There can be many files but each job only uses a few
  • Can be partitioned into smaller tasks, that are run with a worker pool

The big files prevents me from using a simple load-balancer and dumb workers to run tasks, most of the time would be spent in file loading. Therefore, I am implementing cache at the worker level and I dedicate a subset of the worker pool to each job during its execution.

Could you recommend me a framework/library/tool that handles autoscaling, intelligent distribution of tasks, job & tasks management, error reporting, synchronization, retries, etc?

So far I have seen:

  • Kubernetes with RabbitMQ routing (or equivalent message brokers), with hand-made job life-cycle management, task repartition and scaling
  • Apache Storm, which is interesting but seems to use only java workers (I would rather use language agnostic workers)
  • Hurricane (Storm ported in C++); doesn't seem feature complete and reliable
  • Spark, Hadoop: seem to apply algorithms on distributed big-data, rather than splitting algorithm and apply each part on the job specific non-splitable data
submitted by /u/nmoreaud
[link] [comments]