Apache Kafka is a publish-subscribe message system, producers publish data to a cluster and clients subscribes to receive data. The messages are sent by their producers and stored in partitions, the load balancing is performed thanks to their distribution between the cluster's nodes. The component which assign a message to a partition is called partitioner, located inside every producer. When partitions lacks intrinsic meaning, and are used purely for load-balancing purposes, the default partitioners available with Apache Kafka aim only to get the same amount of messages shared between partitions. The most common Apache Kafka cluster configuration is based on multiple identical systems, when a cluster is updated with new more performing components the old ones are usually removed. Even if re-balancing tools exists, it would take time to properly adapt to an hybrid cluster configuration, this is caused by partitioners focus on data amount rather than node performance. The problem could be solved by changing the amount of partitions in each old and new system, matching their performance ratio, thus tricking the default partitioner logic, but this actually could hurt client performance. A proper partitioner which knows the performance of each cluser's node is a correct solution, this document will present a formal method to detect problematic scenarios and a custom partitioner that adapts to them.

Load balancing and fault early detection for Apache Kafka clusters

Burato, Dario
2019/2020

Abstract

Apache Kafka is a publish-subscribe message system, producers publish data to a cluster and clients subscribes to receive data. The messages are sent by their producers and stored in partitions, the load balancing is performed thanks to their distribution between the cluster's nodes. The component which assign a message to a partition is called partitioner, located inside every producer. When partitions lacks intrinsic meaning, and are used purely for load-balancing purposes, the default partitioners available with Apache Kafka aim only to get the same amount of messages shared between partitions. The most common Apache Kafka cluster configuration is based on multiple identical systems, when a cluster is updated with new more performing components the old ones are usually removed. Even if re-balancing tools exists, it would take time to properly adapt to an hybrid cluster configuration, this is caused by partitioners focus on data amount rather than node performance. The problem could be solved by changing the amount of partitions in each old and new system, matching their performance ratio, thus tricking the default partitioner logic, but this actually could hurt client performance. A proper partitioner which knows the performance of each cluser's node is a correct solution, this document will present a formal method to detect problematic scenarios and a custom partitioner that adapts to them.
2019-07-10
File in questo prodotto:
File Dimensione Formato  
843238-1221607.pdf

accesso aperto

Tipologia: Altro materiale allegato
Dimensione 1.36 MB
Formato Adobe PDF
1.36 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/22818