Elasticsearch® is a great system for managing data, be it documents, text, logs, metrics, and more. It can do this at nearly any scale, with high-availability, and a powerful query language. It does, however, have a reputation of being hard to troubleshoot when things go wrong.
What can go wrong? Usually one of three things: Slow Performance, Yellow/Red Indexes, and Node issues such as disk space, queues, breakers, timing out. These are often seen in some combination, and can be quite difficult so figure out what's wrong, and then what to do about it, often involving arcane JSON commands you hope you get right.
The real challenge is to understand the holistic current situation & interplay between the settings, memory, indexes, queries, nodes, and more. And then what to do about it.
Each of these challenges involves so many moving parts and subsystems, that having a holistic view of both the cluster and the problem index or node can be quite challenging, especially for part-time ELK administrators who just want it to work again.
Even if you've mastered all the definitions, configuration, and management of segments, shards, indexes nodes, ILM, snapshots, caches, and more, it can still be hard to see the state and history of all those things in real-time, or to know which to adjust to make the cluster healthy again.
Monitoring data helps and Kibana has nice interfaces for this, though sadly, ELK logs are not always very useful, nor accessible, other than for seriously broken issues like network, disk, or configuration problems. The real challenge is to understand the holistic current situation & interplay between the settings, memory, indexes, queries, nodes, and more. And then what to do about it.
Our ELKman tool can help troubleshoot Elasticsearch®, on several levels. First, you can get cluster overviews to flag obvious issues, and then dive into the problem area, such as a single unhealthy index, or failed node causing several related, but temporary problems.
ELKman also audits the current cluster and index configuration and statistics for best practices, and identifying potential problem areas to investigate. This can really help save time by finding easy-to-break settings, overloaded RAM, and much more. It is checking dozens of things that most sysadmins may be unaware of, not know how to check, or forget to look at in the heat of troubleshooting.
Try to say no to JSON & hand-crafted API calls.
When it comes time to fix problems, ELKman includes easy-to-use graphical settings panels, plus common functions in menus to fix things quickly and safely, such as reallocating shards, fixing read-only indexes, merging indexes, and much more. All without JSON or hand-crafted API call.
Elasticsearch® is a great product, and getting better all the time, but remains a little hard to troubleshoot. Our goal is to make this easier for you, so Elasticsearch® can reach its full potential in your environment.