-
This is a combination of notification used to trigger other activities
-
They are usualy represented in a key value pair. This event can be serialized or deserialized based on the language being used. This can be to JSON, ProtBuff etc
-
The key in a Kafka message is a just an identifier.
- Fundamental unit of event organissation. It is like a table consisting different events.
- The same kinds of events can be transformed using the key.
- They can be refered to as logs but not queues. Has two characteristics.
- Appends only
- Can only seek by offset and not indexed
- Events in a topic are immutable.
- Topics are fundamentaly durable. Can be configured the messames to expire after they have reached a certain age or limit.
- Kafka gives us ability to pertition topics into partitons.
- The messages being created is being distributed to this partitions based on the keys provided. (keeping in mind that messages in kafka are in form of key-value pairs)
- if the massages do not have a key then they are distributed to the partitions in a round-robin manner.
- It allows Kafka to put the massages with the same key to land in the same partitons. Hence preserving ordering by key.
A computer, instance, or container running the Kafka process.
- Each broker manages partions and handles write and read requests to the partions.
- manages replication of partions. They are kept simply for easy understanding snd scalling.
It can be riskier to store data in partions under one broker. This is because this brokers are suscetible to failier. To monouver around this, we will need to copy partition data to other several brokers to keep ot safe. This copies are called Follower replicas while the main Broker is called leader replica.
- in general, writes and reads happens at the leader broker.
- This is where most developers spend their time. partioning lives in the producer. It decides wherer to place the key value messages produced
- is responsible for:
- partition assignment
- Batch events for improved throughput
- Compression
- Retries
- Responsible callbacks
- Transaction handling
producer = Producer(config)
- acks -> this is used to determine the level of acknowledgeent reqiored before returning from produce request, (0,1,all). The default value is "all"
- batch.size -> number of bytes that will be batched before sending produce request. Default value is 16384.
- linger.ms -> Number of milliseconds to wait for batch before sending produce request. Default 0.
- compression.type -> Algorithm used to compress data sent to the broker. can be set to None, gzip, snapy, lz4, or zstd . Default is None
- retries -> Number od times to retry a request that failed for potentially transient reasons.
- delivery.timeout.ms -> This is the time limit for overal produce request. Can also be used to limit retries. Default value is 120,000ms (2 mins).
- transactional.id -> A unique id for the producer that enables transaction recovery across multiple sessions of a single producer instance. Default is None
- enable.idempotence -> When set to True, the producer adds a unique sequence number to messages. Default is True. This ensures that only one copy if this message is written to the stream.
- produce() : Below is what we use to send message to topics.
producer.produce(topic, [val],[key],[partition],[on_delivery],[timestamp],[headers])
# example
for i in range(10):
producer.produce("bookings",{"user_id": "34342ksdf-eww","booked": "True"}, str(i), on_delivery = callback, headers={'foo':'bar'})
produce.flush()
- In the above code we only passed the topic(which is required), the value , the key, on-delivery and headers. It is not amust to pass the partiton key since the key provided can be used to determine the partiton where the value will be placed.
- This function will also perfom compression if configured and also perfom retries if any network issues is experinced.
- The produce method is asynchronas, to ensure that all the current produced requests are complete we need to call the produce.flush() method.
-
inittransactions() -> used to initialize transactions handling for this producer instance. parameters that can be passed is _timeout.
-
begin_transactions() -> used to begin a new transaction.
-
committransaction() -> Flush any active produce requests and mark transaction as commited. accepts _timeout as the parameter.
-
aborttransaction() -> used to purge all produce requests in this transaction and mark transaction as aborted. takes _timeout as the parameter.
-
sendoffsets_to_transaction() -> Used to communicate with consumer group when transsaction includes producers and consumers. Takes 3 parameters: _offsets, group_metadata, timeout.
These are like components subscribed to a topic. They consume events posted to the topic by the producer. In Kafka reading a message does not get it destroyed, instead it still remains there for other componets that might want to use it later. Many consumers can read from one topic. A single instance of a consumer will receive messages from all the partions of the topic it is subscribed to. If there is a second instance of the consumer application then, the partitoning are distributed equaly to the consumers. This is called Reballancing. This alllows scalling and error handling by default. Theres is nothing required to make this scalling happen. This rebalancing forms the backborn of several features of kafka. The consumer keeps track of completed events that has succesfully been processed. this can be done by us or automatically, depending on the config property.
Just like the producer. it takes the config dictionary. the config also has the bootsrap server of the broker to connect to.
consumer = Consumer(config)
- group.id -> Uniquely identifies this application so that additional instance are included in a consumer group. 2. outo.offset.reset -> Determines offset to begin consuming at if no valid stored offset os available. The defailt value is latest.
- enable.auto.commit -> if true, periodically commit offsets in the background. The default value is true. The recomended way is to set this value to false and commit the offsets manualy.
- isolation.level -> used in transactional processing. (read_uncommitted, read_committed). The default value is read_committed. when set to read_uncommitted value, the consumer reads all the events that were uncommited including those that were aborted.
consumer.subscribe(['bookings'],on_assign=assignment_callback)
def assignment_callback(consumer,topic_partitions):
for tp in topic_partions:
print(tp.topic)
print(tp.partiton)
print(tp.offset)
while True:
event = consumer.poll(timeout=1.0)
if event is None:
continue
if event.error():
# handle error
else:
# process event
it has the following methods:
- error() -> returns KafkaError or None
- key() -> returns a str, bytes or None
- value() -> returns str, bytes or None
- headers() -> returns Lists of tuples (key, value)
- timestamp() -> returns a Tuple of timestamp type (TIMESTAMP_CREATE_TIME or TIMESTAMP_LOG_APPEND_TIME) and value
- topic() -> returns str or None
- partition() -> returns int or None
- offset() -> returns int or None
while True:
event = comsumer.poll(timeout=1.0)
if event is None:
continue
if event.error():
# handle error
else:
key = event.key().decode('utf8')
val = event.value().decode('utf8')
print(f'Received {val} with key of {key}')
commit.commit(event) # to coomit the offset if auto commit is set to false.
- converts data into bytes to be stored in kafka topic
serializer = JSONSerializer(sr_client,schema_registry_client, to_dict=obj_to_dict)
key = string_serializer(str(uuid4())),
val = serializer(obj, SerializationContext(topic, MessageField.VALUE))
produce.produce(topic = topic, key = key, value = val)
- Converts bytes from Kafka topics into usable data
deserializer = JSONDeserializer(sr_client, from_dict = dict_to_obj)
obj = deserilizer(event.value(), serializationContext(event.topic(), MessageField.VALUE))
-
Provides access to the schema registry
sr_client = SchemaRegistryClient({ 'url' : '<schema Registry endpoint >' 'basic.auth.user.info':'<SR_UserName: SR_Password>' })
This is an ecosystem of plugable connectors On the other hand it is a client application. This is an application tunning outside the kafka clusters. It abstract alot of code connectors from the user. Connect worker runs one or more workers. Source connectors - act as producers. Sink connectors - act as consumers.