說明

日志新貴 loki 很香？那么看了這篇排錯后或許你不這么想了。

github 社區很不作為，大多 issue 都沒解決，都是自動關閉掉了
日志報錯相當模糊，沒有邊界，無法確定穩定到底真正出在什么組件
分布式環境，組件還是很復雜的
日志量大的時候，資源開銷也特別大，特別是內存開銷，而且 oOM 這根本不好控制
沿用了 cortex 那套，基于 hash ring，也是個萬惡之源

排錯

注意：排錯比較凌亂，因為很多錯誤都沒有具體邊界，不能定位到真正的問題，但是基本都試錯解決掉了。不過仍然有一部分老骨頭依舊沒解決

錯誤 1：

# kubectl logs -f -n grafana promtail-hzvcw 
level=warn ts=2020-11-20T10:55:18.654550762Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://loki:3100/loki/api/v1/push\": dial tcp: lookup loki on 100.100.2.138:53: no such host"
level=warn ts=2020-11-20T10:55:40.951543459Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://loki:3100/loki/api/v1/push\": dial tcp: lookup loki on 100.100.2.138:53: no such host"

解決：
因為這里換了 hostnetwork 網絡，所以默認走了宿主機的 dns，導致無法解析

# kubectl edit ds -n grafana promtail 
...
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true

錯誤 2：

# kubectl logs -f -n grafana promtail-hzvcw 
level=warn ts=2020-11-20T11:02:38.873724616Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:39.570025453Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:41.165844784Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2470' lines totaling '1048456' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
level=warn ts=2020-11-20T11:02:44.446410234Z caller=client.go:288 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '2462' lines totaling '1048494' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"

解決：
參考：https://github.com/grafana/loki/issues/1923
這經常發生在初始安裝并開始收集日志的情況下，因為已經積累了很多日志，但是還都沒有收集。可能是因為它正在同時索引所有這些巨大的系統日志文件。

因為你要收集的日志太多了，超過了 loki 的限制，所以會報 429 錯誤，如果你要增加限制可以修改 loki 的配置文件:
注意：這里面不應該設置太大，這樣可能會造成 ingester 壓力過大

config:
  limits_config:
    ingestion_rate_strategy: local
    # 每個用戶每秒的采樣率限制
    ingestion_rate_mb: 15
    # 每個用戶允許的采樣突發大小
    ingestion_burst_size_mb: 20

錯誤 3：

啟動 Loki 的時候直接提示：

Retention period should now be a multiple of periodic table duration

解決：
保留時間段必須是周期表的倍數，默認情況下 168 小時一張表，那么日志保留時間應該是 168 的倍數，比如：168 x 4

schema_config:
  configs:
  - from: 2020-10-24
    # 配置索引的更新和存儲方式，默認 168h
    index:
      prefix: index_
      period: 24h
    # 配置塊的更新和存儲方式，默認 168h
    chunks:
      period: 24h
...
table_manager:
  retention_deletes_enabled: true
  # 設置為上面 index.period 和 chunks.period 的倍數
  retention_period: 72h

錯誤 4：

點擊 grafana live 報錯：

error: undefined
An unexpected error happened

解決：
查看 loki 日志：

# kubectl logs -f -n grafana --tail=10 querier-loki-0
level=error ts=2020-11-27T11:02:51.783911277Z caller=http.go:217 org_id=fake traceID=26e4e30b17b6caf9 msg="Error in upgrading websocket" err="websocket: the client is not using the websocket protocol: 'upgrade' token not found in 'Connection' header"
level=error ts=2020-11-27T11:04:05.316230666Z caller=http.go:217 org_id=fake traceID=71a571b766390d4f msg="Error in upgrading websocket" err="websocket: the client is not using the websocket protocol: 'websocket' token not found in 'Upgrade' header"

# kubectl logs -f -n grafana --tail=10 frontend-loki-1
level=warn ts=2020-11-27T10:56:29.735085942Z caller=logging.go:60 traceID=2044a83a71a5274a msg="GET /loki/api/v1/tail?query=%7Bapp_kubernetes_io_managed_by%3D%22Helm%22%7D 23.923117ms, error: http: request method or response status code does not allow body ws: true; Accept-Encoding: gzip, deflate; Accept-Language: en,zh-CN;q=0.9,zh;q=0.8; Cache-Control: no-cache; Connection: Upgrade; Pragma: no-cache; Sec-Websocket-Extensions: permessage-deflate; client_max_window_bits; Sec-Websocket-Key: v+7oOwg9O8RTMZ4PrFLXVw==; Sec-Websocket-Version: 13; Upgrade: websocket; User-Agent: Grafana/7.3.2; X-Forwarded-For: 10.30.0.74, 10.41.131.193, 10.41.131.193; X-Forwarded-Server: pub-k8s-mgt-prd-05654-ecs; X-Real-Ip: 10.30.0.74; X-Server-Ip: 10.30.0.74; "

通常情況下當反向代理或負載平衡器未正確傳遞 WebSocket 請求時，就會出現此問題。例如 Nginx：

proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade

參考：https://github.com/grafana/grafana/issues/22905
當您將 Loki 添加到數據源中時，需要在配置中為 Grafana 中的 Loki 數據源添加兩個自定義標頭：

# 表示 Upgrade 是一個 hop-by-hop 的字段，這個字段是給 proxy 看的
Connection: Upgrade

表示瀏覽器想要升級到 WebSocke t協議，這個字段是給最終處理請求的程序看的

# 如果只有 Upgrade: websocket，說明 proxy 不支持 websocket 升級，按照標準應該視為普通 HTTP 請求
Upgrade: websocket

參考：https://github.com/grafana/loki/issues/2878

  frontend:
    tail_proxy_url: "http://querier-loki:3100"

錯誤 5：

# kubectl logs -f -n grafana --tail=10 frontend-loki-0
2020-11-27 11:09:10.401591 I | http: proxy error: unsupported protocol scheme "querier-loki"

解決：

  frontend:
    #tail_proxy_url: "querier-loki:3100"
    tail_proxy_url: "http://querier-loki:3100"

錯誤 6：

grafana 設置 Line limit 查詢 20000 時報錯：

max entries limit per query exceeded, limit > max_entries_limit (20000 > 5000)

解決：
參考：https://github.com/grafana/loki/issues/2226
注意：不要設置太大，會造成查詢壓力

limits_config:
  # 默認值 5000
  max_entries_limit_per_query: 20000

錯誤 7：

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-23T10:14:42.241840832Z caller=client.go:294 component=client host=loki:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry for stream '{container=\"filebeat\", controller_revision_hash=\"9b74f8b55\", filename=\"/var/log/pods/elastic-system_filebeat-pkxdh_07dbab26-9c45-4133-be31-54f359e9a733/filebeat/0.log\", job=\"elastic-system/filebeat\", k8s_app=\"filebeat\", namespace=\"elastic-system\", pod=\"filebeat-pkxdh\", pod_template_generation=\"6\", restart=\"time04\", stream=\"stderr\"}' has timestamp too old: 2020-11-12 03:24:07.638861214 +0000 UTC"

解決：

limits_config:
  # 禁用 reject_old_samples 參數
  reject_old_samples: false
  reject_old_samples_max_age: 168h

錯誤 8：

loki 內存一直增長導致 OOMKilled

# kubectl describe pod -n grafana loki-0
...
Containers:
  loki:
    ...
    State:          Running
      Started:      Mon, 23 Nov 2020 19:54:33 +0800
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 23 Nov 2020 19:38:12 +0800
      Finished:     Mon, 23 Nov 2020 19:54:32 +0800
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     8
      memory:  40Gi
    Requests:
      cpu:        1
      memory:     20Gi

解決：
通過分布式部署后，發現是 ingester 消耗了大量的內存，10 個實例都能達到 40GB 而被 OOMKilled

config:
  ingester:
    # 如果未達到最大塊大小，則在刷新之前應在內存中放置多長時間沒有更新。這意味著半空的塊將在一定時間后仍會被刷新，只要它們沒有進一步的活動即可。
    chunk_idle_period: 3m
    # 在刷新后 chunk 在內存中保留多長時間
    chunk_retain_period: 1m
    chunk_encoding: gzip

錯誤 9：

# kubectl logs -f -n grafana distributor-loki-0
level=error ts=2020-11-24T10:22:04.66435891Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"
level=error ts=2020-11-24T10:22:19.664371484Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"

解決：
第一種情況：
參考：https://github.com/grafana/loki/issues/2155
這里說是 replication_factor 和 ingester 示例的副本數量不一致導致，我這里副本都為 1，ingester 的 replication_factor 為 1，cassandra 的 replication_factor 為 3
查詢 KEYSPACE 信息：

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "SELECT * FROM system_schema.keyspaces;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

 keyspace_name      | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
        system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
      system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
 system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
             system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
               loki |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

刪除 KEYSPACE：

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "DROP KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

重新創建 KEYSPACE：

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra

我這里問題依舊

第二種情況：
參考：https://github.com/grafana/loki/issues/2131
注意：collectors/ 是 Loki 配置 consul 的前綴

# consul kv get -keys /
collectors/
# consul kv delete -recurse collectors

在 etcd 設置中也看到了同樣的問題。必須在 etcd 中執行此操作才能使其正常工作：

# ETCDCTL_API=3 etcdctl get --prefix collectors/ --keys-only
collectors/ring

# ETCDCTL_API=3 etcdctl del "" --from-key=true

第三種情況：
初始啟動時候發生，因為 hash ring 里面根本沒有這個實例，所以一定報錯
參考源碼：github.com/cortexproject/cortex/pkg/ring/client/pool.go

func (p *Pool) removeStaleClients() {
    // Only if service discovery has been configured.
    if p.discovery == nil {
        return
    }

    serviceAddrs, err := p.discovery()
    if err != nil {
        level.Error(util.Logger).Log("msg", "error removing stale clients", "err", err)
        return
    }

    for _, addr := range p.RegisteredAddresses() {
        if util.StringsContain(serviceAddrs, addr) {
            continue
        }
        level.Info(util.Logger).Log("msg", "removing stale client", "addr", addr)
        p.RemoveClientFor(addr)
    }
}

參考源碼：github.com/cortexproject/cortex/pkg/ring/ring.go

var (
    // ErrEmptyRing is the error returned when trying to get an element when nothing has been added to hash.
    ErrEmptyRing = errors.New("empty ring")

    // ErrInstanceNotFound is the error returned when trying to get information for an instance
    // not registered within the ring.
    ErrInstanceNotFound = errors.New("instance not found in the ring")
)

從源碼看，感覺不是什么大問題，發生在要移除不健康的客戶端時，但是從 ring 里面查詢不到
實際上是我這邊的配置問題導致，分布式里面配置文件不完整。最終我用一份完整的配置文件來部署解決了這個問題。

錯誤 10：

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-25T12:05:41.624367027Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:05:41.731088455Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"

解決：
參考：https://github.com/grafana/loki/issues/1159
目前只能通過刪除并重新部署環使它們恢復健康狀態
這個錯誤出現在刪除了所有 ingester 實例重新部署時，此時之前的 ring 還在 consul 中存在，所以第一個 ingester 實例認為需要找到其他 ingester 實例，但肯定都找不到
這里只要刪除 consul 中的 ring 即可，讓 ingester 認為這是一個新環境

# kubectl exec -it -n grafana consul-server-0 -- consul kv get -keys /
collectors/

# kubectl exec -it -n grafana consul-server-0 -- consul kv delete -recurse collectors
Success! Deleted keys with prefix: collectors

也可以調用 API 去刪除：

# curl -XDELETE localhost:8500/v1/kv/collectors/ring

注意：該錯誤也出現在因為自動加入超時被殺掉的那段時間

# kubectl get pod -n grafana
NAME                                  READY   STATUS    RESTARTS   AGE
ingester-loki-0                       1/1     Running   3          28m
ingester-loki-1                       1/1     Running   0          21m
ingester-loki-2                       1/1     Running   0          20m
ingester-loki-3                       1/1     Running   0          15m
ingester-loki-4                       1/1     Running   1          14m
ingester-loki-5                       1/1     Running   1          10m

# kubectl logs -f -n grafana ingester-loki-5 --previous 
level=info ts=2020-11-25T12:22:18.844831919Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-25T12:22:18.850195155Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-11-25T12:22:18.881569904Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester
level=info ts=2020-11-25T12:23:16.687397103Z caller=signals.go:55 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2020-11-25T12:23:16.687736993Z caller=lifecycler.go:444 msg="lifecycler loop() exited gracefully" ring=ingester
level=info ts=2020-11-25T12:23:16.687775292Z caller=lifecycler.go:743 msg="changing instance state from" old_state=ACTIVE new_state=LEAVING ring=ingester
level=error ts=2020-11-25T12:23:17.095929216Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:17.26296551Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:17.57792403Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:18.091885973Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:19.43179565Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:22.61721327Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:26.915926188Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:30.19295829Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:33.96531916Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:38.966422889Z caller=transfer.go:200 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2020-11-25T12:23:38.9664856Z caller=lifecycler.go:788 msg="failed to transfer chunks to another instance" ring=ingester err="terminated after 10 retries"

# kubectl logs -f -n grafana ingester-loki-5 
level=info ts=2020-11-25T12:24:04.174795365Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-25T12:24:04.181074205Z caller=lifecycler.go:578 msg="existing entry found in ring" state=ACTIVE tokens=128 ring=ingester

可以優化下 join_after 時間，太長、太短都不行：

ingester:
  lifecycler:
    # 當該成員離開時，要等多久才能從另一個成員領取令牌和塊。持續時間到期后將自動加入
    join_after: 30s

錯誤 11：

ingester 集群在高并發發生雪崩，無法啟動：

# kubectl get pod -n grafana |egrep ingester-loki       
ingester-loki-0                       0/1     CrashLoopBackOff   9          84m
ingester-loki-1                       0/1     CrashLoopBackOff   8          77m
ingester-loki-2                       0/1     CrashLoopBackOff   8          76m
ingester-loki-3                       1/1     Running            0          71m
ingester-loki-4                       0/1     CrashLoopBackOff   9          70m
ingester-loki-5                       0/1     CrashLoopBackOff   9          66m
ingester-loki-6                       0/1     CrashLoopBackOff   7          63m
ingester-loki-7                       0/1     CrashLoopBackOff   8          60m
ingester-loki-8                       0/1     CrashLoopBackOff   7          59m
ingester-loki-9                       0/1     Running            8          58m

# kubectl logs -f -n grafana ingester-loki-9
level=warn ts=2020-11-25T13:20:28.700549245Z caller=lifecycler.go:232 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-loki-7 past heartbeat timeout"
level=warn ts=2020-11-25T13:20:32.375494375Z caller=lifecycler.go:232 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-loki-6 in state LEAVING"

解決：
參考：https://github.com/cortexproject/cortex/issues/3040
這是因為 ingester 實例異常掛掉，如刪除重新創建出來的 pod，這樣導致老的實例信息依舊在 rig 里（之前看好像是 pod 名 + IP 字樣，容器重啟 pod 名不變，但 IP 變了），目前好像沒有什么解決辦法，只能粗暴一些刪掉 ring 的信息
對于 consul：

# kubectl exec -it -n grafana consul-server-0 -- consul kv delete -recurse collectors

對于 etcd：

# kubectl exec -it -n grafana etcd-0 /bin/sh
# ETCDCTL_API=3 etcdctl del "" --from-key=true

修改 cassandra 一致性策略為 ONE 即可

  storage_config:
    cassandra:
      addresses: cassandra-cassandra-dc1-dc1-nodes
      port: 9042
      keyspace: loki
      #consistency: "QUORUM"
      consistency: "ONE"

錯誤 12：

# kubectl logs -f -n grafana ingester-loki-0
level=error ts=2020-11-24T10:39:58.544128896Z caller=lifecycler.go:788 msg="failed to transfer chunks to another instance" ring=ingester err="terminated after 10 retries"
level=info ts=2020-11-24T10:40:28.552357089Z caller=lifecycler.go:496 msg="instance removed from the KV store" ring=ingester
level=info ts=2020-11-24T10:40:28.552435782Z caller=module_service.go:90 msg="module stopped" module=ingester
level=info ts=2020-11-24T10:40:28.552470413Z caller=module_service.go:90 msg="module stopped" module=memberlist-kv
level=info ts=2020-11-24T10:40:28.555855919Z caller=module_service.go:90 msg="module stopped" module=store
level=info ts=2020-11-24T10:40:28.556037275Z caller=server_service.go:50 msg="server stopped"
level=info ts=2020-11-24T10:40:28.556059022Z caller=module_service.go:90 msg="module stopped" module=server
level=info ts=2020-11-24T10:40:28.556073269Z caller=loki.go:228 msg="Loki stopped"

解決：
ingester 第一次啟動發生該錯誤，然后它自動重啟后就好了，應該是 loki 的重連機制有問題

# kubectl logs -f -n grafana ingester-loki-0 --tail=100
level=info ts=2020-11-24T10:40:31.181716616Z caller=loki.go:227 msg="Loki started"
level=info ts=2020-11-24T10:40:31.186107519Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-11-24T10:40:31.210151298Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester

錯誤 13：

點擊 grafana 添加數據源提示：Loki: Internal Server Error. 500. too many failed ingesters
解決：

# kubectl logs -f -n grafana frontend-loki-0
level=error ts=2020-11-26T05:22:25.719644039Z caller=retry.go:71 msg="error processing request" try=0 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.720963727Z caller=retry.go:71 msg="error processing request" try=1 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.722093452Z caller=retry.go:71 msg="error processing request" try=2 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.722679543Z caller=retry.go:71 msg="error processing request" try=3 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=error ts=2020-11-26T05:22:25.723216916Z caller=retry.go:71 msg="error processing request" try=4 err="rpc error: code = Code(500) desc = too many failed ingesters\n"
level=warn ts=2020-11-26T05:22:25.723320728Z caller=logging.go:71 traceID=75ef1fe1fdb05ce msg="GET /loki/api/v1/label?start=1606367545667000000 (500) 4.884289ms Response: \"too many failed ingesters\\n\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: zh-CN; Dnt: 1; User-Agent: Grafana/7.3.2; X-Forwarded-For: 10.30.0.73, 10.41.131.198, 10.41.131.198; X-Forwarded-Server: pub-k8s-mgt-prd-05667-ecs; X-Grafana-Nocache: true; X-Grafana-Org-Id: 1; X-Real-Ip: 10.30.0.73; X-Server-Ip: 10.30.0.73; "

需要啟用 frontend_worker

  # 配置 querier worker，采集并執行由 query-frontend 排隊的查詢
  frontend_worker:

錯誤 14：

# kubectl logs -f -n grafana distributor-loki-0 |grep error
level=warn ts=2020-11-24T17:02:35.222118895Z caller=logging.go:71 traceID=3c236f4a1c3df592 msg="POST /loki/api/v1/push (500) 12.95762ms Response: \"rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5211272 vs. 4194304)\\n\" ws: false; Content-Length: 944036; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T17:02:36.143885453Z caller=logging.go:71 traceID=610c85e39bf2205e msg="POST /loki/api/v1/push (500) 29.834383ms Response: \"rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5211272 vs. 4194304)\\n\" ws: false; Content-Length: 944036; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解決：

  server:
    grpc_server_max_recv_msg_size: 8388608
    grpc_server_max_send_msg_size: 8388608

錯誤 15：

# kubectl logs -f -n grafana ingester-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-25T10:47:03.021701842Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:03.874812809Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:04.119370803Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."
level=error ts=2020-11-25T10:47:04.289613481Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Operation timed out - received only 0 responses."

解決：
只有 1 個數據的副本的情況遇到，需要增加鍵空間的復制因子。
如果您的復制因子為 1，則您將依賴 1 個節點來響應對特定分區的查詢。將您的 RF 增加到大約 3，將使您的應用程序對性能不佳的節點或出現故障的節點更具彈性。

# ALTER KEYSPACE loki WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

有人說增加下面參數，未驗證：

# vi cassandra.yaml
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 2000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
commitlog_segment_size_in_mb: 32

也可以修改 loki

  storage_config:
    cassandra:
      timeout: 30s

錯誤 16:

# kubectl logs -f -n grafana ingester-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-24T07:41:08.234749133Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="no chunk table found for time 1603287501.182"
level=error ts=2020-11-24T07:41:15.543694173Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="context deadline exceeded"
level=error ts=2020-11-24T07:41:16.785829547Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="context deadline exceeded"

解決：
應該是 ingester 遇到性能問題導致，ingester 經常 OOM

在 ingester 集群正常運行的那一段時間里，發現 cassandra 所在的節點 load 很高

# top
top - 19:34:13 up  9:30,  1 user,  load average: 19.11, 19.24, 17.57
Tasks: 352 total,   4 running, 348 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.1 us, 10.9 sy,  0.0 ni, 34.6 id,  2.8 wa,  0.0 hi,  3.7 si,  0.0 st
KiB Mem : 65806668 total, 32340232 free, 17603344 used, 15863092 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 47487824 avail Mem

上下文切換高

# vmstat 1 -w
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
13  1            0     32187816       368908     15561596    0    0   713  2349   44   74  22   5  72   1   0
10  0            0     32181392       368916     15561204    0    0 46824 17544 134678 116836  46  13  41   1   0
38  0            0     32153804       368980     15559408    0    0 17008 117840 134737 86828  52  14  33   2   0
89  3            0     32139480       369004     15559560    0    0 36988 117944 139337 92005  52  16  31   1   0
19  0            0     32177728       369064     15577352    0    0 13740 105716 109264 100241  48  12  31   9   0
 8  2            0     32215968       369100     15568960    0    0 29956 109324 146574 96660  50  16  32   2   0
11  0            0     32237324       369156     15562756    0    0 34932 103524 107201 71884  50  13  34   4   0

磁盤飽和度和 wait 都高

# iostat -x 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          45.00    0.00   14.85    3.87    0.00   36.28

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    33.00    0.00   12.00     0.00   228.00    38.00     0.01    7.58    0.00    7.58   0.58   0.70
vdb               0.00     0.00    0.00    2.00     0.00     4.00     4.00     0.00    1.00    0.00    1.00   1.00   0.20
vdc               0.00     0.00    0.00    4.00     0.00     0.00     0.00     0.00    0.25    0.00    0.25   0.00   0.00
vdd               0.00    72.00  105.00  158.00  5400.00 61480.00   508.59    11.60   44.29    2.02   72.37   1.32  34.60
vde               1.00    64.00  195.00  215.00  8708.00 93776.00   499.92    36.01   90.11   20.66  153.10   1.71  70.20
vdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

查看磁盤這兩個盤確實是 cassandra 的
# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    253:0    0  100G  0 disk 
└─vda1 253:1    0  100G  0 part /
vdb    253:16   0  200G  0 disk /data0
vdc    253:32   0  200G  0 disk /var/lib/container
vdd    253:48   0    2T  0 disk /var/lib/container/kubelet/pods/83e41036-6df9-4a81-8ac8-f9a606d7f029/volumes/kubernetes.io~csi/disk-da080df0-fda9-
vde    253:64   0    2T  0 disk /var/lib/container/kubelet/pods/8f49e222-9197-443a-9054-aa1996be9493/volumes/kubernetes.io~csi/disk-43627433-f76a-
vdf    253:80   0   20G  0 disk /var/lib/container/kubelet/pods/96f86124-0f31-477f-a764-b3eaebb0bc6d/volumes/kubernetes.io~csi/disk-ca32c719-3cfa-

所以這邊猜測，這里的錯誤是因為 cassandra 性能不行導致的，另外 ingester 一直刷盤慢導致內存升高，進而導致了 OOM，應該也是這個原因

錯誤 17：

# kubectl logs -f -n grafana querier-loki-0 |egrep -v 'level=debug|level=info'
level=error ts=2020-11-24T17:20:54.847423531Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"
level=error ts=2020-11-24T17:20:55.001467122Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"
level=error ts=2020-11-24T17:20:55.109948605Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection closed"

解決：
這里根本沒有開啟 grpc 端口的服務：

# kubectl get svc -n grafana |grep frontend-loki
frontend-loki                            ClusterIP   172.21.11.213   <none>        3100/TCP                                                                  5h3m
frontend-loki-headless                   ClusterIP   None            <none>        3100/TCP                                                                  5h3m

本來想通過 extraPorts 和 service 字段來定義的，但是發現 service 字段不支持多端口，所以只好自己定義：

# cat > frontend-loki-grpc.yaml <<EOF
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: frontend
    meta.helm.sh/release-namespace: grafana
  labels:
    app: loki
    app.kubernetes.io/managed-by: Helm
    chart: loki-2.0.2
    heritage: Helm
    release: frontend
  name: frontend-loki-grpc
  namespace: grafana
spec:
  ports:
  - name: grpc
    port: 9095
    protocol: TCP
    targetPort: 9095
  selector:
    app: loki
    release: frontend
  sessionAffinity: None
  type: ClusterIP
EOF

# kubectl apply -f frontend-loki-grpc.yaml 
service/frontend-loki-grpc created

修改配置：

  frontend_worker:
    frontend_address: "frontend-loki-grpc:9095"

錯誤 18：

# kubectl logs -f -n grafana distributor-loki-0 |egrep -v 'level=debug|level=info'
level=warn ts=2020-11-24T18:00:27.948766075Z caller=logging.go:71 traceID=570d6eabac64c40a msg="POST /loki/api/v1/push (500) 2.347177ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 168137; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:28.182532696Z caller=logging.go:71 traceID=40dda83791c46d22 msg="POST /loki/api/v1/push (500) 2.930596ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 180980; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:28.512289072Z caller=logging.go:71 traceID=4de6a07f956e0a26 msg="POST /loki/api/v1/push (500) 2.218879ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 155860; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-24T18:00:32.355800743Z caller=logging.go:71 traceID=23893b2884f630d6 msg="POST /loki/api/v1/push (500) 899.118μs Response: \"empty ring\\n\" ws: false; Content-Length: 42077; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

# kubectl logs -f -n grafana promtail-zszhj
level=warn ts=2020-11-24T17:59:40.125688742Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"
level=warn ts=2020-11-24T17:59:49.18473538Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"
level=warn ts=2020-11-24T18:00:20.779127702Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): at least 1 live replicas required, could only find 0"

# kubectl logs -f -n grafana promtail-zszhj
level=warn ts=2020-11-24T17:36:03.040901602Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.41.186.88:9095: connect: connection refused\""
level=warn ts=2020-11-24T17:36:04.316609825Z caller=client.go:288 component=client host=distributor-loki:3100 msg="error sending batch, will retry" status=500 error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.41.186.88:9095: connect: connection refused\""

解決：

因為 promtail 連不上 distributor 導致，可能是 distributor 掛了
因為 distributor 連不上 ingester 導致，ingester 經常 OOM 掛掉時會出現這些錯誤
當 promtail 發送失敗發生，但會進入 backoff 重試，所以也不用過分擔心
參考源碼：loki/pkg/promtail/client/client.go

    for backoff.Ongoing() {
        start := time.Now()
        status, err = c.send(ctx, tenantID, buf)
        requestDuration.WithLabelValues(strconv.Itoa(status), c.cfg.URL.Host).Observe(time.Since(start).Seconds())

        ...

        // Only retry 429s, 500s and connection-level errors.
        if status > 0 && status != 429 && status/100 != 5 {
            break
        }

        // 這里的 err 來源于上面的 c.send()
        level.Warn(c.logger).Log("msg", "error sending batch, will retry", "status", status, "error", err)
        batchRetries.WithLabelValues(c.cfg.URL.Host).Inc()
        backoff.Wait()
    }


func (c *client) send(ctx context.Context, tenantID string, buf []byte) (int, error) {
    ctx, cancel := context.WithTimeout(ctx, c.cfg.Timeout)
    defer cancel()
    req, err := http.NewRequest("POST", c.cfg.URL.String(), bytes.NewReader(buf))
    if err != nil {
        return -1, err
    }
    req = req.WithContext(ctx)
    req.Header.Set("Content-Type", contentType)
    req.Header.Set("User-Agent", UserAgent)

    ...

    if resp.StatusCode/100 != 2 {
        scanner := bufio.NewScanner(io.LimitReader(resp.Body, maxErrMsgLen))
        line := ""
        if scanner.Scan() {
            line = scanner.Text()
        }
        err = fmt.Errorf("server returned HTTP status %s (%d): %s", resp.Status, resp.StatusCode, line)
    }
    return resp.StatusCode, err
}

錯誤 20：

Grafana 添加 frontend-loki 報錯：Loki: Internal Server Error. 500. unsupported protocol scheme "querier-loki"
解決：

  frontend:
    # 下游 query 的 URL，必須帶上 http:// 同時需要帶上 NameSpace
    # 注意：官方文檔中說是 prometheus 的地址是錯誤的
    downstream_url: "http://querier-loki.grafana:3100"

錯誤 21：

grafana 查詢報錯：unconfigured table index_18585
解決：

查詢的時間范圍大于了數據有效期，我這里查詢大于 3 天的數據就會出現這個問題

  table_manager:
    retention_deletes_enabled: true
    retention_period: 72h

replication_factor 大于 consistency 導致的

  storage_config:
    cassandra:
      #consistency: "QUORUM"
      consistency: "ONE"
      # replication_factor 不兼容 NetworkTopologyStrategy 策略
      replication_factor: 1

錯誤 22：

grafana 查詢報錯：Cannot achieve consistency level QUORUM
解決：
雖然我這邊 cassandra 有 3 個節點，但是創建 keyspace 的時候設置了 replication_factor 為 1，所以集群中的 cassandra 節點因為 oom 掛掉導致該錯誤：

# CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

算法：

quorum = (sum_of_replication_factors / 2) + 1

其中：

sum_of_replication_factors = datacenter1_RF + datacenter2_RF + … + datacentern_RF

錯誤 23：

grafana 查詢報錯：gocql: no response received from cassandra within timeout period
解決：
增加對 cassandra 的超時時間

  storage_config:
    cassandra:
      timeout: 30s
      connect_timeout: 30s

錯誤 24：

# kubectl logs -f -n grafana ingester-loki-0
level=info ts=2020-11-24T07:28:49.886973328Z caller=events.go:247 module=gocql client=table-manager msg=Session.handleNodeUp ip=10.41.178.155 port=9042
level=error ts=2020-11-24T07:28:49.996021839Z caller=connectionpool.go:523 module=gocql client=table-manager msg="failed to connect" address=10.41.178.155:9042 error="Keyspace 'loki' does not exist"

解決：
這里是因為沒有提前在 cassandra 里創建 loki 這個 Keyspace 導致，可以提前手工創建：

# CREATE KEYSPACE loki WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

但是查詢了 cassandra，實際是被自動創建出來了，并且 loki 自動重啟一次后就不會報這個錯誤了，應該還是 loki 錯誤重試機制有問題

錯誤 25：

# kubectl logs -f -n grafana querier-loki-0
level=info ts=2020-11-27T04:37:11.75172174Z caller=events.go:247 module=gocql client=index-write msg=Session.handleNodeUp ip=10.41.238.253 port=9042
level=info ts=2020-11-27T04:37:11.760460559Z caller=events.go:247 module=gocql client=chunks-write msg=Session.handleNodeUp ip=10.41.176.102 port=9042
level=info ts=2020-11-27T04:37:11.760530285Z caller=events.go:271 module=gocql client=chunks-write msg=Session.handleNodeDown ip=10.41.238.255 port=9042
level=info ts=2020-11-27T04:37:11.776492597Z caller=events.go:247 module=gocql client=chunks-read msg=Session.handleNodeUp ip=10.41.176.65 port=9042
level=info ts=2020-11-27T04:37:11.776559951Z caller=events.go:271 module=gocql client=chunks-read msg=Session.handleNodeDown ip=10.41.239.0 port=9042
level=error ts=2020-11-27T04:37:11.783500883Z caller=connectionpool.go:523 module=gocql client=index-read msg="failed to connect" address=10.41.176.65:9042 error="Keyspace 'loki' does not exist"
level=error ts=2020-11-27T04:37:11.847266845Z caller=connectionpool.go:523 module=gocql client=index-write msg="failed to connect" address=10.41.238.253:9042 error="Keyspace 'loki' does not exist"

解決：
將 cassandra 的復制策略從 SimpleStrategy 修改 NetworkTopologyStrategy 后發生該錯誤，修改配置重啟 querier 即可

錯誤 26：

# kubectl logs -f -n grafana distributor-loki-0
level=error ts=2020-11-26T05:36:58.159259504Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-11-26T05:37:13.159263304Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-11-26T05:37:28.159244957Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"

解決：

ingester 服務掛了導致
遇到 consul 里的 ring 打開非常慢導致
參考：https://github.com/hashicorp/consul/issues/3358
consul 的 key 默認有 512KB 的限制，大了之后就無法讀取了

  ingester:
    lifecycler:
      # 注冊在哈希環上的 token 數，可以理解為虛擬節點
      # 設置 512 導致consul 里的 ring 打開非常慢，其他組件一直報 too many failed ingesters 錯誤，但是 ingester 大多數實例基本是 OK 的
      num_tokens: 128

關于副本因子：
參考源碼：https://github.com/grafana/loki/blob/master/vendor/github.com/cortexproject/cortex/pkg/ring/replication_strategy.go

func (s *DefaultReplicationStrategy) Filter(ingesters []IngesterDesc, op Operation, replicationFactor int, heartbeatTimeout time.Duration, zoneAwarenessEnabled bool) ([]IngesterDesc, int, error) {
    // We need a response from a quorum of ingesters, which is n/2 + 1.  In the
    // case of a node joining/leaving, the actual replica set might be bigger
    // than the replication factor, so use the bigger or the two.
    if len(ingesters) > replicationFactor {
        replicationFactor = len(ingesters)
    }

    minSuccess := (replicationFactor / 2) + 1

    // 注意這里說明
    // Skip those that have not heartbeated in a while. NB these are still
    // included in the calculation of minSuccess, so if too many failed ingesters
    // will cause the whole write to fail.
    for i := 0; i < len(ingesters); {
        if ingesters[i].IsHealthy(op, heartbeatTimeout) {
            i++
        } else {
            ingesters = append(ingesters[:i], ingesters[i+1:]...)
        }
    }

    // This is just a shortcut - if there are not minSuccess available ingesters,
    // after filtering out dead ones, don't even bother trying.
    if len(ingesters) < minSuccess {
        var err error

        if zoneAwarenessEnabled {
            err = fmt.Errorf("at least %d live replicas required across different availability zones, could only find %d", minSuccess, len(ingesters))
        } else {
            err = fmt.Errorf("at least %d live replicas required, could only find %d", minSuccess, len(ingesters))
        }

        return nil, 0, err
    }

    return ingesters, len(ingesters) - minSuccess, nil
}

參考源碼：https://github.com/grafana/loki/blob/master/vendor/github.com/cortexproject/cortex/pkg/ring/ring.go
// GetAll returns all available ingesters in the ring.

func (r *Ring) GetAll(op Operation) (ReplicationSet, error) {
    r.mtx.RLock()
    defer r.mtx.RUnlock()

    if r.ringDesc == nil || len(r.ringTokens) == 0 {
        return ReplicationSet{}, ErrEmptyRing
    }

    // Calculate the number of required ingesters;
    // ensure we always require at least RF-1 when RF=3.
    numRequired := len(r.ringDesc.Ingesters)
    if numRequired < r.cfg.ReplicationFactor {
        numRequired = r.cfg.ReplicationFactor
    }
    maxUnavailable := r.cfg.ReplicationFactor / 2
    numRequired -= maxUnavailable

    ingesters := make([]IngesterDesc, 0, len(r.ringDesc.Ingesters))
    for _, ingester := range r.ringDesc.Ingesters {
        if r.IsHealthy(&ingester, op) {
            ingesters = append(ingesters, ingester)
        }
    }

    if len(ingesters) < numRequired {
        return ReplicationSet{}, fmt.Errorf("too many failed ingesters")
    }

    return ReplicationSet{
        Ingesters: ingesters,
        MaxErrors: len(ingesters) - numRequired,
    }, nil
}

錯誤 27：

# kubectl logs -f -n grafana ingester-loki-2
level=warn ts=2020-11-26T11:28:06.835845619Z caller=grpc_logging.go:38 method=/logproto.Pusher/Push duration=4.811872ms err="rpc error: code = Code(400) desc = entry with timestamp 2020-11-26 11:28:06.736090173 +0000 UTC ignored, reason: 'entry out of order' for stream: {...},\nentry with timestamp 2020-11-26 11:28:06.736272024 +0000 UTC ignored, reason: 'entry out of order' for stream: {...}, total ignored: 38 out of 38" msg="gRPC\n"

解決：
這是因為日志亂序了，ingester 產生的告警

錯誤 28：

# kubectl logs -f -n grafana ingester-loki-0 --previous 
level=info ts=2020-11-26T09:41:07.828515989Z caller=main.go:128 msg="Starting Loki" version="(version=2.0.0, branch=HEAD, revision=6978ee5d7)"
level=info ts=2020-11-26T09:41:07.828765981Z caller=server.go:225 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=error ts=2020-11-26T09:41:17.867540854Z caller=session.go:286 module=gocql client=index-read msg="dns error" error="lookup cassandra-cassandra-dc1-dc1-nodes on 172.21.0.10:53: no such host"

解決：

  storage_config:
    cassandra:
      addresses: cassandra-cassandra-dc1-dc1-nodes
      disable_initial_host_lookup: true

錯誤 29：

# kubectl logs -f -n grafana ingester-loki-10 
level=error ts=2020-11-27T03:05:52.561015003Z caller=main.go:85 msg="validating config" err="invalid schema config: the table period must be a multiple of 24h (1h for schema v1)"

解決：
時間周期必須是 24h 的整數倍

  schema_config:
    configs:
    - from: 2020-10-24
      index:
        prefix: index_
        period: 24h
      chunks:
        prefix: chunks_
        period: 24h

錯誤 30：

# kubectl logs -f -n grafana querier-loki-0
level=error ts=2020-11-27T04:05:51.648649625Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.21.0.34:9095: connect: connection refused\""
level=error ts=2020-11-27T04:05:51.648673039Z caller=worker_frontend_manager.go:96 msg="error contacting frontend" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.21.0.34:9095: connect: connection refused\""

解決：

frontend 服務掛了，恢復 frontend 服務即可
frontend 服務配置不對，需要重新應用

錯誤 31：

# kubectl logs -f -n grafana querier-loki-0
level=error ts=2020-11-27T04:18:55.308433009Z caller=connectionpool.go:523 module=gocql client=index-write msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"
level=error ts=2020-11-27T04:18:55.308493581Z caller=connectionpool.go:523 module=gocql client=chunks-write msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"
level=error ts=2020-11-27T04:18:55.308461356Z caller=connectionpool.go:523 module=gocql client=index-read msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response to connection startup within timeout"
level=error ts=2020-11-27T04:18:55.308497652Z caller=connectionpool.go:523 module=gocql client=chunks-read msg="failed to connect" address=10.41.176.218:9042 error="gocql: no response received from cassandra within timeout period"

解決：
這里應該是連不上 cassandra 數據庫了

# kubectl get pod -n grafana -o wide |grep 10.41.176.218
cassandra-cassandra-dc1-dc1-rack1-6   1/2     Running   0          15h     10.41.176.218   cn-hangzhou.10.41.128.145   <none>           <none>

# kubectl logs -f -n grafana cassandra-cassandra-dc1-dc1-rack1-6 -c cassandra     
WARN  [MessagingService-Incoming-/10.41.176.122] IncomingTcpConnection.java:103 UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 26534be0-3042-11eb-b160-ef05878ef351. If a table was just created, this is likely due to the schema not being fully propagated.  Please wait for schema agreement on table creation.
        at org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1578) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:900) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:875) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) ~[apache-cassandra-3.11.9.jar:3.11.9]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.9.jar:3.11.9]

解決：
當新添加節點到已經存在的集群時，新節點是沒有 schema 的，它需要從集群中同步表。
新添加節點時，如果該節點還沒有創建表結構，而被添加到集群中，會導致分發到這個節點的請求無法被處理

錯誤 33：

# kubectl logs -f -n grafana --tail=10 distributor-loki-0
level=warn ts=2020-11-27T05:51:48.000652876Z caller=logging.go:71 traceID=5b92b35f3d057623 msg="POST /loki/api/v1/push (500) 460.775μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 78746; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:51:48.000995397Z caller=logging.go:71 traceID=5095e3e5e161a426 msg="POST /loki/api/v1/push (500) 222.613μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 6991; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:51:48.31756207Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:51:48.366711308Z caller=logging.go:71 traceID=79703c1097fb3890 msg="POST /loki/api/v1/push (500) 441.255μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 11611; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:18.000340865Z caller=logging.go:71 traceID=69210fec26d8a92b msg="POST /loki/api/v1/push (500) 157.989μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7214; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:18.00061787Z caller=logging.go:71 traceID=1a19280a13626e4d msg="POST /loki/api/v1/push (500) 284.977μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 2767; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:52:18.317576958Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:52:18.357480921Z caller=logging.go:71 traceID=2e144bed2da000e6 msg="POST /loki/api/v1/push (500) 438.836μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 12040; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:52:48.000666573Z caller=logging.go:71 traceID=7b072e8c150d335f msg="POST /loki/api/v1/push (500) 292.152μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7142; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:52:48.317561107Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:53:18.000228447Z caller=logging.go:71 traceID=4f484503331187aa msg="POST /loki/api/v1/push (500) 1.347378ms Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 89191; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:53:18.00071364Z caller=logging.go:71 traceID=73a47d220a07f209 msg="POST /loki/api/v1/push (500) 584.582μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 8336; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:53:18.317557575Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:53:48.000718725Z caller=logging.go:71 traceID=6b8433e4ce976efe msg="POST /loki/api/v1/push (500) 674.648μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 81500; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:53:48.000751823Z caller=logging.go:71 traceID=3788174d48a7748c msg="POST /loki/api/v1/push (500) 278.135μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 6658; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
...
level=error ts=2020-11-27T05:53:48.317566199Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=warn ts=2020-11-27T05:54:18.000965033Z caller=logging.go:71 traceID=273b54ffd8b6f851 msg="POST /loki/api/v1/push (500) 170.859μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 7161; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-27T05:54:18.001483342Z caller=logging.go:71 traceID=66c98097e558e47d msg="POST /loki/api/v1/push (500) 373.922μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 4954; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解決：
這里錯誤發現一個規律，too many failed ingesters 基本是 30s 一次，所以這邊可以找配置文件中 30s 的參數

錯誤 34：

# kubectl logs -f -n grafana --tail=10 querier-loki-1 
level=error ts=2020-11-27T06:59:46.02122327Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-11-27T06:59:46.021237567Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"
level=error ts=2020-11-27T06:59:46.021184495Z caller=worker_frontend_manager.go:102 msg="error processing requests" err="rpc error: code = Unavailable desc = transport is closing"

解決：
frontend 重新部署、刪除、異常的時候出現該錯誤，算是服務端斷開導致

錯誤 35：

grafana 查詢報錯：

<html> <head><title>504 Gateway Time-out</title></head> <body bgcolor="white"> <center><h1>504 Gateway Time-out</h1></center> <hr><center>nginx</center> </body> </html> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page -->

解決：

# kubectl logs -f -n grafana --tail=10 frontend-loki-0
level=error ts=2020-11-27T08:12:31.178604776Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:12:31.685861905Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:12:31.782152574Z caller=retry.go:71 msg="error processing request" try=0 err=EOF
level=error ts=2020-11-27T08:13:00.340916358Z caller=retry.go:71 msg="error processing request" try=1 err="context canceled"
level=info ts=2020-11-27T08:13:00.340962799Z caller=frontend.go:220 org_id=fake traceID=349eb374b92cff3 msg="slow query detected" method=GET host=frontend-loki.grafana:3100 path=/loki/api/v1/query_range time_taken=59.995024399s param_query="{app=\"flog\"} |= \"/solutions/turn-key/facilitate\"" param_start=1606443120000000000 param_end=1606464721000000000 param_step=10 param_direction=BACKWARD param_limit=1000
level=info ts=2020-11-27T08:13:00.341110652Z caller=metrics.go:81 org_id=fake traceID=349eb374b92cff3 latency=fast query="{app=\"flog\"} |= \"/solutions/turn-key/facilitate\"" query_type=filter range_type=range length=6h0m1s step=10s duration=0s status=499 throughput_mb=0 total_bytes_mb=0

查詢量過大超時導致的

錯誤 36：

# kubectl logs -f -n grafana --tail=10 querier-loki-0
level=info ts=2020-11-28T14:24:04.047366237Z caller=metrics.go:81 org_id=fake traceID=51ab87b8886e7240 latency=fast query="{app=\"flog\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=173.147382ms status=200 throughput_mb=8.287327 total_bytes_mb=1.434929
level=info ts=2020-11-28T14:24:05.03985635Z caller=metrics.go:81 org_id=fake traceID=46f7eb80b08eee92 latency=fast query="{app=\"flog\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=180.095448ms status=200 throughput_mb=7.967602 total_bytes_mb=1.434929
level=error ts=2020-11-28T14:24:44.799142534Z caller=http.go:256 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error from client" err="websocket: close 1006 (abnormal closure): unexpected EOF"
level=error ts=2020-11-28T14:24:44.799151755Z caller=http.go:279 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error writing to websocket" err="writev tcp 10.41.182.22:3100->10.41.191.121:49894: writev: connection reset by peer"
level=error ts=2020-11-28T14:24:44.799310451Z caller=http.go:281 org_id=fake traceID=2ca9bf8b1ba97c08 msg="Error writing close message to websocket" err="writev tcp 10.41.182.22:3100->10.41.191.121:49894: writev: connection reset by peer

解決：

# kubectl  get pod -n grafana -o wide |grep 10.41.182.22
querier-loki-0                        1/1     Running   0          11h     10.41.182.22    cn-hangzhou.10.41.131.196   <none>           <none>

# kubectl  get pod -n grafana -o wide |grep 10.41.191.121
frontend-loki-0                       1/1     Running   0          11h     10.41.191.121   cn-hangzhou.10.41.131.200   <none>           <none>

這里錯誤發現一個規律，基本是 30s 左右 websocket 報錯，可能跟超時有關

錯誤 37：

grafana 點擊報錯：Cannot achieve consistency level ONE
解決：

kubectl logs -f -n grafana ingester-loki-1 --tail=1
level=error ts=2020-11-28T07:15:19.544408457Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"
level=error ts=2020-11-28T07:15:19.602324008Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"
level=error ts=2020-11-28T07:15:19.602912508Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="Cannot achieve consistency level ONE"

cassandra 集群掛了導致
cassandra 集群 KEYSPACE 或策略設置不對

錯誤 38：

# kubectl logs -f -n grafana --tail=10 cassandra-cassandra-dc1-dc1-rack1-2 -c cassandra  
INFO  [CompactionExecutor:126] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:128] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576
INFO  [CompactionExecutor:129] NoSpamLogger.java:91 Maximum memory usage reached (267386880), cannot allocate chunk of 1048576

解決：

# vi cassandra.yaml
file_cache_size_in_mb: 2048

錯誤 40：

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra 

CREATE KEYSPACE loki WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'}  AND durable_writes = true;
解決：
使用此選項，您可以指示 Cassandra 是否對當前 KeySpace 的更新使用 commitlog。此選項不是強制性的，默認情況下它設置為 true
稍等片刻，Cassandra 會自動同步到所有節點


錯誤 41：
# kubectl logs -f -n grafana ingester-loki-1 --tail=1
level=error ts=2020-11-28T07:46:07.802171225Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.805926054Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.813179527Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"
level=error ts=2020-11-28T07:46:07.814556323Z caller=flush.go:199 org_id=fake msg="failed to flush user" err="unconfigured table chunks_18594"

解決：
稍等片刻，Loki會自動創建，然后 Cassandra 會自動同步到所有節點

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra |grep chunks_18594

# kubectl exec cassandra-cassandra-dc1-dc1-rack1-0 -c cassandra -n grafana -- cqlsh -e "desc KEYSPACE loki;" cassandra-cassandra-dc1-dc1-nodes -ucassandra -pcassandra |grep chunks_18594
CREATE TABLE loki.chunks_18594 (

錯誤 42：

# kubectl logs -f -n grafana promtail-n6hvs 
level=error ts=2020-07-06T03:58:02.217480067Z caller=client.go:247 component=client host=192.179.11.1:3100 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry for stream '{app=\"app_error\", filename=\"/error.log\", host=\"192.179.11.12\"}' has timestamp too new: 2020-07-06 03:58:01.175699907 +0000 UTC"

解決：
這個是兩臺機器的時間相差太大了，我 promtail 這臺機器的時間沒有和 ntp 服務器同步時間，所以就報了這個錯誤，只要把時間都同步了就好了

錯誤 43：

# kubectl logs -f -n grafana --tail=1 distributor-loki-0
level=warn ts=2020-11-28T11:22:36.356296195Z caller=logging.go:71 traceID=368cb9d5bab3db0 msg="POST /loki/api/v1/push (500) 683.54μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 11580; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-28T11:22:36.448942924Z caller=logging.go:71 traceID=59e19c2fd604299 msg="POST /loki/api/v1/push (500) 391.358μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 10026; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "
level=warn ts=2020-11-28T11:22:36.470361015Z caller=logging.go:71 traceID=152d6e5c824eb473 msg="POST /loki/api/v1/push (500) 873.049μs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Content-Length: 81116; Content-Type: application/x-protobuf; User-Agent: promtail/2.0.0; "

解決：

錯誤 44 ：

# helm upgrade --install -f loki-config.yaml distributor --set config.target=distributor --set replicas=10 loki-2.0.2.tgz -n grafana
Error: UPGRADE FAILED: "distributor" has no deployed releases

解決：
好像是沒卸載干凈，但是也沒發現 release：

# helm list -n grafana 
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
consul          grafana         1               2020-11-24 18:11:01.057473546 +0800 CST deployed        consul-0.26.0   1.8.5      
etcd            grafana         1               2020-11-27 10:37:34.954108464 +0800 CST deployed        etcd-5.2.1      3.4.14     
ingester        grafana         1               2020-11-30 21:10:33.961141159 +0800 CST deployed        loki-2.0.2      v2.0.0     
promtail        grafana         1               2020-11-28 21:24:12.542476902 +0800 CST deployed        promtail-2.0.1  v2.0.0     
redis           grafana         1               2020-11-30 21:00:49.567268068 +0800 CST deployed        redis-12.1.1    6.0.9

用 install 命令也不行：

# helm install -f loki-config.yaml distributor --set config.target=distributor --set replicas=10 loki-2.0.2.tgz -n grafana          
Error: cannot re-use a name that is still in use

查看全部 release，發現了 uninstalling 狀態：

# helm list -n grafana -a
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
consul          grafana         1               2020-11-24 18:11:01.057473546 +0800 CST deployed        consul-0.26.0   1.8.5      
distributor     grafana         1               2020-11-30 18:58:58.082367639 +0800 CST uninstalling    loki-2.0.2      v2.0.0     
etcd            grafana         1               2020-11-27 10:37:34.954108464 +0800 CST deployed        etcd-5.2.1      3.4.14     
ingester        grafana         1               2020-11-30 21:10:33.961141159 +0800 CST deployed        loki-2.0.2      v2.0.0     
promtail        grafana         1               2020-11-28 21:24:12.542476902 +0800 CST deployed        promtail-2.0.1  v2.0.0     
redis           grafana         1               2020-11-30 21:00:49.567268068 +0800 CST deployed        redis-12.1.1    6.0.9  

# helm  uninstall -n grafana distributor --timeout 0s

錯誤 45：

# kubectl logs -f -n grafana ingester-loki-4 --tail=100
level=error ts=2020-11-30T10:54:24.752039646Z caller=redis_cache.go:57 msg="failed to put to redis" name=store.index-cache-write.redis err="EXECABORT Transaction discarded because of previous errors."
level=error ts=2020-11-30T10:54:24.752183644Z caller=redis_cache.go:37 msg="failed to get from redis" name=chunksredis err="MOVED 3270 10.41.178.48:6379"

解決：
這里是啟用了事務，當執行了多個命令，不管有幾個命令是執行正確的，只要有一個命令有語法錯誤，執行 EXEC 命令后 Redis 就會直接返回錯誤，連語法正確的命令也不會執行
將 redis cluster 切換為 redis master/slave 解決了

錯誤 46：

redis 1 臺 master 和 2 臺 slave 的內存一直飆升，最終掛掉：
居然不是 OOM，從日志看是無法執行 exec 檢查導致被 kill 了，這個可能是 redis 容器失去響應

# kubectl describe pod -n grafana redis-master-0
    State:          Running
      Started:      Tue, 01 Dec 2020 14:14:24 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 01 Dec 2020 13:38:26 +0800
      Finished:     Tue, 01 Dec 2020 14:14:23 +0800
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c /health/ping_liveness_local.sh 5] delay=5s timeout=6s period=5s #success=1 #failure=5
    Readiness:      exec [sh -c /health/ping_readiness_local.sh 1] delay=5s timeout=2s period=5s #success=1 #failure=5
...
Events:
  Type     Reason          Age                From                                Message
  ----     ------          ----               ----                                -------
  Warning  Unhealthy       53s (x2 over 53s)  kubelet, cn-hangzhou.10.41.131.206  Readiness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
  Warning  Unhealthy       50s (x2 over 52s)  kubelet, cn-hangzhou.10.41.131.206  Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
  Normal   Started         49s (x2 over 36m)  kubelet, cn-hangzhou.10.41.131.206  Started container redis

解決：
參考：https://blog.csdn.net/chenleixing/article/details/50530419
分析可能原因：

edis-cluste r的 bug
客戶端的 hash(key) 有問題，造成分配不均
存在個別大的 key-value: 例如一個包含了幾百萬數據 set 數據結構
主從復制出現了問題
其他原因（這里提到了 monitor 進程導致的）

查找大 key，應該也不是：

# redis-cli -c -h redis-master -a kong62123 --bigkeys                  
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest string found so far '"fake/b00fe6372e647097:1761d0599b8:1761d05d357:b8134b7f"' with 266984 bytes
[00.00%] Biggest string found so far '"fake/72c6b4617011c665:1761cfd2419:1761cfd2850:521c8196"' with 509820 bytes
[00.02%] Biggest string found so far '"fake/cfdd2f587dae9791:1761cfe0194:1761cfe106e:2c7958bb"' with 510839 bytes
[00.04%] Biggest string found so far '"fake/142dc7b3b33f73a5:1761d05ddd4:1761d061d58:69d40fe1"' with 900471 bytes
[00.04%] Biggest string found so far '"fake/232db4b7f88cd92:1761d0a461b:1761d0a76f7:3b53df38"' with 1096777 bytes
[01.63%] Biggest string found so far '"fake/e2f278b0b0f1ec0e:1761d0a817e:1761d0aceb2:e6235dde"' with 1285507 bytes
[05.45%] Biggest string found so far '"fake/2155e9e474220562:1761cfccdd6:1761cfd0975:3f0923bb"' with 1437657 bytes
[05.79%] Biggest string found so far '"fake/75acab73c87ad9e8:1761d06946e:1761d06e0e2:15dddbcc"' with 1485350 bytes
[06.83%] Biggest string found so far '"fake/fab0b46790906085:1761cfe7f43:1761cfea36c:c828e40e"' with 1519460 bytes
[09.30%] Biggest string found so far '"fake/1876208440d8eb69:1761d0a8523:1761d0ac7a1:8ad790b9"' with 1553344 bytes
[45.59%] Biggest string found so far '"fake/5aca699958e02b9e:1761d0a8496:1761d0ac7f7:5a2fa202"' with 1553464 bytes
[57.92%] Biggest string found so far '"fake/f4003cd6c5a6ae4:1761cff0111:1761cff406c:84d48291"' with 1896730 bytes
[83.41%] Biggest string found so far '"fake/31c1ada0c1213aeb:1761cff011c:1761cff4075:806fd546"' with 1896849 bytes

-------- summary -------

Sampled 184141 keys in the keyspace!
Total key length in bytes is 10414202 (avg len 56.56)

Biggest string found '"fake/31c1ada0c1213aeb:1761cff011c:1761cff4075:806fd546"' has 1896849 bytes

0 lists with 0 items (00.00% of keys, avg size 0.00)
0 hashs with 0 fields (00.00% of keys, avg size 0.00)
184141 strings with 20423715612 bytes (100.00% of keys, avg size 110913.46)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
0 sets with 0 members (00.00% of keys, avg size 0.00)
0 zsets with 0 members (00.00% of keys, avg size 0.00)

這里關閉 redis-exporter 后問題依舊

那么應該是客戶端的問題

      redis:
        endpoint: redis-master:6379
        # Redis Sentinel master name. An empty string for Redis Server or Redis Cluster.
        #master_name: master
        timeout: 10s
        # 修改默認過期時間 1h，注意：不能太小，太小 redis 一直利用不上
        expiration: 10m

loki 集成 clickhouse

原則上最新版本代碼里面已經支持 grpc-store 的實現，我們也具體實現了 grpc-store 接口，可以連接到 clickhouse，但目前無法 flush 數據

schema_config:
  configs:
    - from: 2020-12-02
      store: boltdb-shipper
      object_store: grpc-store
      schema: v11
      index:
        prefix: index_
        period: 24h
      chunks:
        prefix: chunks_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  grpc_store:
    server_address: 172.16.49.14:50001
  filesystem:
    directory: /tmp/loki/chunks

參考：https://github.com/grafana/loki/issues/745
這里說明 clickhouse 不能用于 Blob 存儲，但是 Loki 和 Cortex 實際上需要 Blob 存儲 Chunk。

參考：https://clickhouse.tech/docs/zh/sql-reference/data-types/string/
字符串可以任意長度的。它可以包含任意的字節集，包含空字節。因此，字符串類型可以代替其他 DBMSs 中的 VARCHAR、BLOB、CLOB 等類型。
這里官方婉轉的表達了其不支持 BLOB 類型

數據庫中的 blob 是什么類型？
BLOB (binary large object) 二進制大對象，是一個可以存儲二進制文件的容器。BLOB 常常是數據庫中用來存儲二進制文件的字段類型。
BLOB 是一個大文件，典型的 BLOB 是一張圖片或一個聲音文件，由于它們的尺寸，必須使用特殊的方式來處理（例如：上傳、下載或者存放到一個數據庫）。
處理 BLOB 的主要思想就是讓文件處理器（如數據庫管理器）不去理會文件是什么，而是關心如何去處理它。
這種處理大數據對象的方法是把雙刃劍，它有可能引發一些問題，如存儲的二進制文件過大，會使數據庫的性能下降。在數據庫中存放體積較大的多媒體對象就是應用程序處理 BLOB 的典型例子。

但目前有個 nodejs 的實現，可以將 loki 對接到 clickhouse：
參考：https://github.com/lmangani/cLoki
其架構實現如下，即完全代替了 loki，那基本上也就不用考慮了

        Grafana
           |
           |
Agent -> Cloki -> Clickhouse

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Loki 日志系統分布式部署實踐八 排錯

說明

排錯