目前遇到的问题主要有3个:两个关于buffer,一个关于connection。下面具体说描述下问题的详细信息及目前我采取的解决措施。先交代下我这里使用的Td-agent架构如下,PS(方便起见以下均将Td-agent简化为TD,关于TD和Fluentd的关系移步我的另一篇Blog)
需要注意: 这里的缓存Buffer设置对0.14.21版本测试生效,亲测0.12.20不生效,具体可到【Fluentd官网】获取支持。
1
2
3
4
5
|
graph LR;
A(Td-client)-->F(Td-forward)
B(Td-client)-->F(Td-forward)
F-->E(Elasticsearch cluster)
E-->K(Kibana)
|
Version:
td-agent 0.14.21
ES Version: 5.0.0, Build: 253032b/2016-10-26T04:37:51.531Z, JVM: 1.8.0_121
And lucene_version : "6.2.0
Td-agent的es插件版本
elasticsearch (1.0.18)
elasticsearch-api (1.0.18)
elasticsearch-transport (1.0.18)
fluent-plugin-elasticsearch (1.8.0)
Q1:Td-client端的Buffer问题
这个问题出现次数最多,而且log暴露的问题也是显而易见,主要解决是参数问题
1
2
3
4
5
6
|
2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5013.205"
2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5073.205"
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
<match logics.**>
type forward
<buffer>
@type file
path /var/log/td-agent/buffer/td-gamex-buffer
chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
compress text #The option to specify compression of each chunks, during events are buffered
flush_mode default
flush_interval 15s #Default: 60s
flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
overflow_action throw_exception
retry_timeout 10m
</buffer>
send_timeout 60s
recover_wait 10s
heartbeat_interval 1s
phi_threshold 16
hard_timeout 60s
heartbeat_type tcp
<server>
name logics.shard
host tdagent.test.net
port 24224
weight 1
</server>
</match>
|
Q2:Td-forward端的Buffer问题
正常来说对forward端配置buffer跟client端一样就行了,不过在实际使用中发现按client的配置会报错,如下,从报错来看是路径的问题,经查证是buffer的路径在forest copy
的配置类型下,需要区分index来进行缓存,使用${tag}
作为buffer存储路径的话就很好的解决了这个问题。类似issues可前往【Github】查看
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
2018-04-11 02:24:29 +0000 [error]: #0 Cannot output messages with tag 'logics.5022.205'
2018-04-11 02:24:29 +0000 [error]: #0 failed to configure sub output copy: Other 'elasticsearch' plugin already use same buffer path: type = elasticsearch, buffer path = /var/log/td-agent/buffer/td-gamex-buffer
2018-04-11 02:24:29 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/buf_file.rb:71:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/output.rb:305:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/inject.rb:104:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_emitter.rb:73:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:504:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:71:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:73:in `block in configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `each'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:132:in `block in plant'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `synchronize'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `plant'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:169:in `emit'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:211:in `process'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/bare_output.rb:53:in `emit_sync'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/event_router.rb:96:in `emit_stream'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:300:in `on_message'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:211:in `block in handle_connection'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `block (3 levels) in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `feed_each'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `block (2 levels) in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `block in read_messages'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `call'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `on_read_without_connection'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:123:in `on_readable'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:186:in `on_readable'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run_once'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
/opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
<match logics.**>
type forest
subtype copy
<template>
<store>
@type elasticsearch
<buffer>
@type file
path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
compress text #The option to specify compression of each chunks, during events are buffered
flush_mode default
flush_interval 15s #Default: 60s
flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
overflow_action throw_exception
retry_timeout 10m
</buffer>
host elasticsearch.test.net
port 9200
logstash_format true
logstash_prefix bilogs
logstash_dateformat logics-${tag_parts[-1]}.%Y.%m.%d
time_key time
request_timeout 60s
reload_connections false
reload_on_failure true
reconnect_on_error true
</store>
</template>
</match>
|
Q3:Td-forward端的connection问题
这个问题主要发生在TD向ES发送数据阶段,起初考虑是ES集群处理能力达到上限,无法分配更对的连接给TD,但是进行Reload之后就正常了,所以这个问题的可能性不大,很可能是TD或者ES在处理连接的逻辑上存在问题,没有正确的关闭或者使用连接。经过查找资料,也找到了一些蛛丝马迹,可供参考的资料也一快带上
A. 【Github 关于这个问题的Issues】
1
2
3
|
2018-03-21 04:28:19 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:28:34 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"
2018-03-21 04:28:19 +0000 [warn]: suppressed same stacktrace
2018-03-21 04:28:35 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:29:08 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"
|
目前解决主要涉及以下几个方面
reload_connections false # defaults to true
You can tune how the elasticsearch-transport host reloading feature works. By default it will reload the host list from the server every 10,000th request to spread the load. This can be an issue if your Elasticsearch cluster is behind a Reverse Proxy, as Fluentd process may not have direct network access to the Elasticsearch nodes.
对于这个参数,我这里ES集群并没有使用代理而是DSN域名
reload_on_failure true # defaults to false
Indicates that the elasticsearch-transport will try to reload the nodes addresses if there is a failure while making the
request, this can be useful to quickly remove a dead node from the list of addresses
这个主要是当请求发生故障时ES-transport将重新加载节点地址,删除死节点,我这里使用的也是true
reconnect_on_error true
Github提到这个有帮助,实测并不好用,问题还是会出现,但频率貌似有减少。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
<match logics.**>
type forest
subtype copy
<template>
<store>
type elasticsearch
<buffer>
@type file
path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
chunk_full_threshold 0.9 #flush chunk when size reaches chunk_limit_size * chunk_full_threshold
compress text #The option to specify compression of each chunks, during events are buffered
flush_mode default
flush_interval 15s #Default: 60s
flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
delayed_commit_timeout 60 #The timeout seconds async write operation fails
overflow_action throw_exception
retry_timeout 10m
</buffer>
host elasticsearch.yingxiong.net
port 9200
logstash_format true
logstash_prefix bilogs
logstash_dateformat logics-${tag_parts[-1]}.%Y.%W
time_key time
flush_interval 10s
request_timeout 15s
num_threads 2
reload_connections false
reload_on_failure true
reconnect_on_error true
</store>
</template>
</match>
|