Time Series Data
April 22, 2023
(Updated on April 24, 2023)
April 22, 2023
(Updated on April 24, 2023)
Recently, I have been working with a lot of time-series data to create critical business metrics dashboards at work. We primarily use MixPanel and Google BigQuery to capture events and store metrics data. The setup to build metrics dashboards is quite simple, i.e. there are scripts that run every few minutes to query the needed data points from MixPanel and BigQuery and save them in a self hosted PostgreSQL instance in a friendly format. This PostgreSQL instance is then connected to Grafana as a datasource and multiple dashboards and alerts are built on top of it using SQL.
So far, this system was working great and served our scale appropriately, but as the number of dashboards and the quantity of data points has been tremendously growing overtime, the underlying PostgreSQL has started to show its age.
I scaled up the PostgreSQL database multiple times to improve the system, however I realised that this is like kicking the can down the road and it is time to move to a dedicated and high performing time-series database.
In very simple words, time-series data is a collection of data points over a period of time. It is often a sequence of metric points in the increasing order of time which enables us to track changes in the metric over time.
These data points can be anything that vary over time, for example, weather information, IoT and sensor data, financial stock information, system metrics such as CPU / memory utilisation, or medical data such as heart rate, pulse etc.
In today’s constantly evolving world, it has become extremely critical for businesses to track change in their crucial metrics and use it to analyse trends, anomalies, growth and take business critical decisions. Many database solutions are coming up that are designed to handle collecting such kind of data.
Essential requirements of a time-series database are:
PostgreSQL is a highly performant and battle tested database management system that can definitely be optimised for storing and querying time-series database. However, it lacks native support for time-series-specific data types such as time interval or time bucketing. Since PostgreSQL is a tradition relational database, it is also difficult to leverage the evolved data model of a time-series object.
Some of the common practices for optimising PostgreSQL for time-series database would involve:
One can also explore the TimescaleDB extension to support efficient time-series data in PostgreSQL. TimescaleDB is an open-source extention that extends PostgreSQL with time-series features, making it easier to store, analyse, and visualise time-series data.
While it is possible to optimise and store time-series data in PostgreSQL, it requires much more manual work to get started and maintain over time. I wanted a lean system that doesn’t require much maintenance to speed up my workflow and decided to explore dedicated databases. That’s when I found out about a purpose-built time-series database InfluxDB.
InfluxDB is a high performing and distributed time-series database written in the Go programming language. It is specially designed to handle huge velocity of time stamped data and offers essential time-series specific features for fast and efficient data query. It is part of the original TICK open source stack designed and developed for fast performance time-series systems.
It also provides functionality that allows us to conserve storage space by keeping data for a defined length of time, by automatically downsampling or expiring and deleting unneeded data from the system.
InfluxDB is designed for high write throughput and query performance. It also supports horizontal scaling making it a much attractive choice for large volumes of data processing by introducing additional nodes to the cluster.
The InfluxDB data model consists of four key components.
InfluxDB offers its own querying and processing language - Flux, which provides a rich set of functions for filtering, aggregating, joining, and manipulating data, and can be used to perform complex data analysis tasks.
InfluxDB also has direct integration with Grafana which makes it a very attractive choice to store data from multiple sources and then perform complex analytics and data visualisation from Grafana using the Flux query language. On top of this, it has a built in powerful visualisation system.
In conclusion, time-series database is now an essential part of our daily life and many solutions to effectively manage and analyse time-series data have become available. While traditional relational databases like PostgreSQL can handle time series data to some extent, specialised time series databases like InfluxDB and TimescaleDB are designed specifically to handle the unique needs of this type of data. These databases offer advanced out of the box features and capabilities such as high-performance queries, scalability, native time series data models, and advanced visualisation options.
For my use case, InfluxDB offers very good built in tools for time-series analysis and prides itself on conceptual simplicity. I have started to really like InfluxDB and will be writing about its features as I explore it further.
If you enjoyed reading this article, checkout another one!