Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Messages to be sent to Azure Blob Storage are buffered in-memory. 

  2. Every time the buffer reaches the maxBlock size, the block is uploaded to Azure Blob Storage asynchronously using stageBlock API provided by Azure.  

  3. List of all blocks uploaded for a blob is maintained in memory. 

  4. During flush of SystemProducer, for each stream-partition, remaining buffer is uploaded as a new block and the block list is committed which creates the Block Blob with list of blocks uploaded using the commitBlockList API provided by Azure. 

  5. Committing block list has to be a blocking call to wait for all pending block uploads of the blob to finish before the commit as this is expected by Azure Blob Storage. 

  6. Messages sent through the SystemProducer after a flush, are part of a new blob. Hence timestamp is added to the blob name as there could be multiple blobs (one per flush of SystemProducer) for the same SSP. This timestamp corresponds to the time when the first message is received by the SystemProducer after a commit.

  7. Optionally, a random string can be suffixed to the blob name to avoid blob name collisions for cases when two tasks write to the same SSP at the same time. 

...