There are two ways to solve any problem: Accurately or approximately. Accurate data structures has its disadvantages – too much memory usage and unscalable for real-time nature of data. In this talk I explained how to take advantage of the newly release Redis 4.0 with pluggable modules to build a data pipeline which uses probabilistic data structures to get real-time insights.
There are different insights and metrics that could be obtained from log events data. Processing the data in real-time and getting accurate results are possible in theory. In practice, not so easy.
Not all results and metrics need to be accurate. There are places where the tradeoff between accuracy and memory usage/scalability is worth it. That is where probabilistic data structures (PDS) come in. In this talk I explained about different PDSs and how they work. And I also talked about how to use Redis and it’s pluggable module system to use these data structure much more efficiently.
- Introduction
- Problem: Parsing high volume & velocity log event data.
- Various metrics to be measured.
- Redis 4.0 and using the new module system.
- Difference between accurate data structures and probabilistic data structures.
- Top-K – Getting the top K items from a dataset.
- Bloom Filters – Check for membership.
- Count Min Sketch – Get item counts.
- Hyperloglog – Cardinality of Sets.