INFLUXDB数据压缩算法

InfluxDB版本1.4.3

InfluxDB中支持的数据类型有：

int
float
bool
string
timestamp

int类型的压缩

// Write encodes v to the underlying buffers.
func (e *IntegerEncoder) Write(v int64) {
	// Delta-encode each value as it's written.  This happens before
	// ZigZagEncoding because the deltas could be negative.
	delta := v - e.prev
	e.prev = v
	enc := ZigZagEncode(delta)
	if len(e.values) > 1 {
		e.rle = e.rle && e.values[len(e.values)-1] == enc
	}

	e.values = append(e.values, enc)
}

上述函数对写入的每个int64，以0为初始值，只记录一个delta值，并把delta值进行zigzag压缩写入buffer。

// Bytes returns a copy of the underlying buffer.
func (e *IntegerEncoder) Bytes() ([]byte, error) {
	// Only run-length encode if it could reduce storage size.
	if e.rle && len(e.values) > 2 {
		return e.encodeRLE()
	}

	for _, v := range e.values {
		// Value is too large to encode using packed format
		if v > simple8b.MaxValue {
			return e.encodeUncompressed()
		}
	}

	return e.encodePacked()
}

可以看到对该函数对buffer的delta值优先使用rle压缩，其次是packed压缩，最后实在没办法了不用压缩。

所以为了使InfluxDB对int值有最好的压缩效果，需要压缩的int序列必须是一个等差数列或者全部一样，例如

1	9999,10000,10001,10002,10003,...

或者

1	123,123,123,123,123

但是在实际业务中，这种值并不一定常见。

bool的压缩
直接使用一个bit来压缩每个bool值, 实测１亿的bool值压缩后是12.3MB，而１亿个bit总的大小为11.9MB，多出来的0.3MB是额外的元信息
float的压缩
使用的是facebook的Gorilla的float压缩算法
string压缩
使用的是snappy压缩算法

timestamp压缩
使用和int压缩差不多的算法

// Bytes returns the encoded bytes of all written times.
func (e *encoder) Bytes() ([]byte, error) {
	if len(e.ts) == 0 {
		return e.bytes[:0], nil
	}

	// Maximum and largest common divisor.  rle is true if dts (the delta timestamps),
	// are all the same.
	max, div, rle, dts := e.reduce()

	// The deltas are all the same, so we can run-length encode them
	if rle && len(e.ts) > 1 {
		return e.encodeRLE(e.ts[0], e.ts[1], div, len(e.ts))
	}

	// We can't compress this time-range, the deltas exceed 1 << 60
	if max > simple8b.MaxValue {
		return e.encodeRaw()
	}

	return e.encodePacked(div, dts)
}

改函数首先reduce了一下整个数组，找出最大值，最大共同除数div，以及是否可以rle等信息，然后按照首先rle，其次encode pack，最后encode raw的优先级对时间戳进行压缩。
同int压缩算法一样，timestamp在全部一样或者是等差数列的时候压缩比最高，同样的个数的timestamp，比如１亿，如果是等差数列，那么压缩后只有1.14MB，如果是随机的timestamp，压缩后是382MB

附录：相关的压缩算法

zigzag
数字４的6十六进制表示为0x00000004（６４位）
改数字表示中，大部分的数字为０，只有后面的几位才是真正记录信息的位置，该压缩算法是忽略签名的０来实现压缩