Fluentd使用中遇到的丢数据问题

目前遇到的问题主要有3个：两个关于buffer，一个关于connection。下面具体说描述下问题的详细信息及目前我采取的解决措施。先交代下我这里使用的Td-agent架构如下，PS（方便起见以下均将Td-agent简化为TD，关于TD和Fluentd的关系移步我的另一篇Blog）

需要注意：这里的缓存Buffer设置对0.14.21版本测试生效，亲测0.12.20不生效，具体可到【Fluentd官网】获取支持。

1
2
3
4
5


graph LR;
	  A(Td-client)-->F(Td-forward)
    B(Td-client)-->F(Td-forward)
    F-->E(Elasticsearch cluster)
    E-->K(Kibana)

Version：

td-agent 0.14.21 ES Version: 5.0.0, Build: 253032b/2016-10-26T04:37:51.531Z, JVM: 1.8.0_121 And lucene_version : "6.2.0 Td-agent的es插件版本 elasticsearch (1.0.18) elasticsearch-api (1.0.18) elasticsearch-transport (1.0.18) fluent-plugin-elasticsearch (1.8.0)

Q1：Td-client端的Buffer问题

这个问题出现次数最多，而且log暴露的问题也是显而易见，主要解决是参数问题

报错日志

1
2
3
4
5
6


  2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5013.205"
  2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5073.205"

修改后的配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


<match logics.**>
   type forward
   <buffer>
     @type file
     path /var/log/td-agent/buffer/td-gamex-buffer
     chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
     total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
     chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
     compress text #The option to specify compression of each chunks, during events are buffered
     flush_mode default
     flush_interval 15s #Default: 60s
     flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
     delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
     overflow_action throw_exception
     retry_timeout 10m
   </buffer>
   send_timeout 60s
   recover_wait 10s
   heartbeat_interval 1s
   phi_threshold 16
   hard_timeout 60s
   heartbeat_type tcp
   <server>
      name logics.shard
      host tdagent.test.net
      port 24224
      weight 1
   </server>
</match>

Q2：Td-forward端的Buffer问题

正常来说对forward端配置buffer跟client端一样就行了，不过在实际使用中发现按client的配置会报错，如下，从报错来看是路径的问题，经查证是buffer的路径在forest copy 的配置类型下，需要区分index来进行缓存，使用${tag}作为buffer存储路径的话就很好的解决了这个问题。类似issues可前往【Github】查看

报错日志

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


2018-04-11 02:24:29 +0000 [error]: #0 Cannot output messages with tag 'logics.5022.205'
2018-04-11 02:24:29 +0000 [error]: #0 failed to configure sub output copy: Other 'elasticsearch' plugin already use same buffer path: type = elasticsearch, buffer path = /var/log/td-agent/buffer/td-gamex-buffer
2018-04-11 02:24:29 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/buf_file.rb:71:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/output.rb:305:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/inject.rb:104:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_emitter.rb:73:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:504:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:71:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:73:in `block in configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `each'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:132:in `block in plant'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `synchronize'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `plant'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:169:in `emit'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:211:in `process'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/bare_output.rb:53:in `emit_sync'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/event_router.rb:96:in `emit_stream'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:300:in `on_message'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:211:in `block in handle_connection'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `block (3 levels) in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `feed_each'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `block (2 levels) in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `block in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `on_read_without_connection'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:123:in `on_readable'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:186:in `on_readable'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run_once'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

解决后的配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


<match logics.**>
  type forest
  subtype copy
  <template>
    <store>
      @type elasticsearch
      <buffer>
        @type file
        path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
        chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
        total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
        chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
        compress text #The option to specify compression of each chunks, during events are buffered
        flush_mode default
        flush_interval 15s #Default: 60s
        flush_thread_count 1 #Default: 1 The number threads  used to write chunks in parallel
        delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
        overflow_action throw_exception
        retry_timeout 10m
      </buffer>
      host elasticsearch.test.net
      port 9200
      logstash_format true
      logstash_prefix bilogs
      logstash_dateformat logics-${tag_parts[-1]}.%Y.%m.%d
      time_key time
      request_timeout 60s
      reload_connections false
      reload_on_failure true
      reconnect_on_error true
    </store>
  </template>
</match>

Q3：Td-forward端的connection问题

这个问题主要发生在TD向ES发送数据阶段，起初考虑是ES集群处理能力达到上限，无法分配更对的连接给TD，但是进行Reload之后就正常了，所以这个问题的可能性不大，很可能是TD或者ES在处理连接的逻辑上存在问题，没有正确的关闭或者使用连接。经过查找资料，也找到了一些蛛丝马迹，可供参考的资料也一快带上

参考资料

A. 【Github 关于这个问题的Issues】

报错信息

1
2
3


2018-03-21 04:28:19 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:28:34 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"
  2018-03-21 04:28:19 +0000 [warn]: suppressed same stacktrace
2018-03-21 04:28:35 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:29:08 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"

再说明 **解决方法：**还是升级吧，升级至1.9.1版本之后该问题消失。【关于修复PR信息点这里】在【版本1.9.1 该Release版本时间为2016.12.14】中修复

目前解决主要涉及以下几个方面

reload_connections false # defaults to true You can tune how the elasticsearch-transport host reloading feature works. By default it will reload the host list from the server every 10,000th request to spread the load. This can be an issue if your Elasticsearch cluster is behind a Reverse Proxy, as Fluentd process may not have direct network access to the Elasticsearch nodes. 对于这个参数，我这里ES集群并没有使用代理而是DSN域名

reload_on_failure true # defaults to false Indicates that the elasticsearch-transport will try to reload the nodes addresses if there is a failure while making the

request, this can be useful to quickly remove a dead node from the list of addresses 这个主要是当请求发生故障时ES-transport将重新加载节点地址，删除死节点，我这里使用的也是true

reconnect_on_error true Github提到这个有帮助，实测并不好用，问题还是会出现，但频率貌似有减少。

目前解决途径

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


<match logics.**>
  type forest
  subtype copy
  <template>
    <store>
      type elasticsearch
      <buffer>
        @type file
        path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
        chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
        total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
        chunk_full_threshold 0.9 #flush chunk when size reaches chunk_limit_size * chunk_full_threshold
        compress text #The option to specify compression of each chunks, during events are buffered
        flush_mode default
        flush_interval 15s #Default: 60s
        flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
        delayed_commit_timeout 60 #The timeout seconds  async write operation fails
        overflow_action throw_exception
        retry_timeout 10m
      </buffer>
      host elasticsearch.yingxiong.net
      port 9200
      logstash_format true
      logstash_prefix bilogs
      logstash_dateformat logics-${tag_parts[-1]}.%Y.%W
      time_key time
      flush_interval 10s
      request_timeout 15s
      num_threads 2
      reload_connections false
      reload_on_failure true
      reconnect_on_error true
    </store>
  </template>
</match>

文章目录

Q1：Td-client端的Buffer问题

Q2：Td-forward端的Buffer问题

Q3：Td-forward端的connection问题