Skip to the content.

Questions to address/reasons for research

  1. Is there a better way for us to scale the pipeline apart from each chain having their own pipeline?
  2. Current pipeline relies on a single instance of kafka – what happens in the event that instance goes down?
  3. Kafka has to have a way to support redundancy, how is it being handled and how do we set it up?
  4. Improve understanding of kafka in general.

Summary

Kafka "TLDR"
Information is obtained from [this playlist from Confluent](https://www.youtube.com/watch?v=qu96DFXtbG4&list=PLa7VYi0yPIH0KbnJQcMv5N9iW8HkZHztH&index=2). 1. Internally stores events (messages) as k-v pairs. - keys _can_ be serialised objects, but are often primitives - because of this keys are not necessarily UIDs for the event (message ID) 2. Topics are not indexed, access by offset - offset = currentKey + currentMsgInBytes (TBC; similar to accessing next value in array by pointers(?)) 3. **Topics != queues -> consumption does not remove it from the topic unless configured otherwise (age/size)** 4. Partitions -> similar to RAID 0, data is striped across partitions - If no key specified to message, round-robin distribution, - Else it goes through a hash function to determine the partition number (similar to hashmap) 5. Producers decides on which petition the message gets sent to (standard/custom config)

Other considerations


Resources

Kafka replication visualised

Kafka replication visualised (source: confluent)