Time Series Data

April 22, 2023

(Updated on April 24, 2023)

Preface

Recently, I have been working with a lot of time-series data to create critical business metrics dashboards at work. We primarily use MixPanel and Google BigQuery to capture events and store metrics data. The setup to build metrics dashboards is quite simple, i.e. there are scripts that run every few minutes to query the needed data points from MixPanel and BigQuery and save them in a self hosted PostgreSQL instance in a friendly format. This PostgreSQL instance is then connected to Grafana as a datasource and multiple dashboards and alerts are built on top of it using SQL.

So far, this system was working great and served our scale appropriately, but as the number of dashboards and the quantity of data points has been tremendously growing overtime, the underlying PostgreSQL has started to show its age.

I scaled up the PostgreSQL database multiple times to improve the system, however I realised that this is like kicking the can down the road and it is time to move to a dedicated and high performing time-series database.

What is Time-Series data ?

In very simple words, time-series data is a collection of data points over a period of time. It is often a sequence of metric points in the increasing order of time which enables us to track changes in the metric over time.

These data points can be anything that vary over time, for example, weather information, IoT and sensor data, financial stock information, system metrics such as CPU / memory utilisation, or medical data such as heart rate, pulse etc.

In today’s constantly evolving world, it has become extremely critical for businesses to track change in their crucial metrics and use it to analyse trends, anomalies, growth and take business critical decisions. Many database solutions are coming up that are designed to handle collecting such kind of data.

Essential requirements of a time-series database are:

High speed writing as data events are consumed at a very high velocity.

Good compression of storage as data dramatically increases overtime.

Efficient interval querying and interval analysis system as data is often fetched in intervals for visualisation and analysis.

Time Series data in PostgreSQL

PostgreSQL is a highly performant and battle tested database management system that can definitely be optimised for storing and querying time-series database. However, it lacks native support for time-series-specific data types such as time interval or time bucketing. Since PostgreSQL is a tradition relational database, it is also difficult to leverage the evolved data model of a time-series object.

Some of the common practices for optimising PostgreSQL for time-series database would involve:

Adopting partition tables - partitioning will improve query performance by dividing large tables into smaller, more manageable partitions based on a time period.

Adding horizontal scaling support for handling large volume of data and increased load as time-series data grows over time.

Achieving high query performance and better compression of data by converting tables to columnar storage. Citus is a popular solution for achieving this and adding horizontal scaling support as well!

One can also explore the TimescaleDB extension to support efficient time-series data in PostgreSQL. TimescaleDB is an open-source extention that extends PostgreSQL with time-series features, making it easier to store, analyse, and visualise time-series data.

While it is possible to optimise and store time-series data in PostgreSQL, it requires much more manual work to get started and maintain over time. I wanted a lean system that doesn’t require much maintenance to speed up my workflow and decided to explore dedicated databases. That’s when I found out about a purpose-built time-series database InfluxDB.

What is InfluxDB ?

InfluxDB is a high performing and distributed time-series database written in the Go programming language. It is specially designed to handle huge velocity of time stamped data and offers essential time-series specific features for fast and efficient data query. It is part of the original TICK open source stack designed and developed for fast performance time-series systems.

It also provides functionality that allows us to conserve storage space by keeping data for a defined length of time, by automatically downsampling or expiring and deleting unneeded data from the system.

Time Series data in InfluxDB

InfluxDB is designed for high write throughput and query performance. It also supports horizontal scaling making it a much attractive choice for large volumes of data processing by introducing additional nodes to the cluster.

The InfluxDB data model consists of four key components.

Bucket: A bucket is a storage container within the Influx database that holds the logically grouped time-series database. We can create different buckets for different set of metrics that can help us fine tune data access controls. Buckets have the inbuilt retention policies that can downsample or delete data after specified time intervals automatically for us.

Measurement: A measurement is a category of time-series data, such as temperature or humidity, that shares a common set of tags and fields. Each measurement is identified by a unique name and can have multiple fields and tags.

Field: A field is a value associated with a measurement at a specific point in time. Each field has a data type, such as integer or float, and can be used to perform mathematical operations, such as filtering or aggregations like mean, sum, count etc.

Tag: A tag is a key-value pair that provides metadata about a measurement. Tags are indexed and can be used to filter, group, and aggregate data. For example, a tag could be used to specify the location of a temperature sensor or the type of device that is sending data.

InfluxDB offers its own querying and processing language - Flux, which provides a rich set of functions for filtering, aggregating, joining, and manipulating data, and can be used to perform complex data analysis tasks.

InfluxDB also has direct integration with Grafana which makes it a very attractive choice to store data from multiple sources and then perform complex analytics and data visualisation from Grafana using the Flux query language. On top of this, it has a built in powerful visualisation system.

In conclusion, time-series database is now an essential part of our daily life and many solutions to effectively manage and analyse time-series data have become available. While traditional relational databases like PostgreSQL can handle time series data to some extent, specialised time series databases like InfluxDB and TimescaleDB are designed specifically to handle the unique needs of this type of data. These databases offer advanced out of the box features and capabilities such as high-performance queries, scalability, native time series data models, and advanced visualisation options.

For my use case, InfluxDB offers very good built in tools for time-series analysis and prides itself on conceptual simplicity. I have started to really like InfluxDB and will be writing about its features as I explore it further.

If you enjoyed reading this article, checkout another one!

Time Series Data

Preface

What is Time-Series data ?

Time Series data in PostgreSQL

What is InfluxDB ?

Time Series data in InfluxDB

Year in Review @ apna.co

Microsoft Certified: Azure Developer Associate

 WWDC'20 Swift Student Challenge