Why do I need three external Zookeeper machines for SolrCloud?
Do I really need three Zookeeper servers?
Yes. A fault tolerant zookeeper ensemble consists of at least three machines, and you need at least two machines for Solr index fault tolerance. That hardware can overlap, so a minimum of three servers is required for SolrCloud redundancy.
You can't use only two machines for zookeeper, because you must have a majority of the total number of servers up and running in order to form a majority vote. Two servers is actually less fault tolerant than one server, because if *either* of them goes down, the ensemble loses quorum. It's best if there is an odd number ... if you have four total machines, a majority vote requires three, so your fault tolerance is the same as with three machines -- only one can go down.
Failures happen. This is a reality of computer systems, and planning for that failure is absolutely critical. We can put a fair amount of redundancy (multiple power supplies, multiple disks) in a single machine, but it can still fail.
If the machine where you have embedded zookeeper fails, your entire cloud is going to effectively go down, even if all the rest of your Solr servers are still running. It might still be possible to make queries, but even if that's the case, you won't be able to update the index, and if one of those remaining Solr instances were to stop, it would not start up correctly without zookeeper.
External Zookeepers for SolrCloud
For a SolrCloud install, it is possible to run zookeeper so that it is embedded in the same process where you are running Solr itself.
An embedded zookeeper is a good option for testing, but for production, it causes too many problems.
When you stop Solr with zookeeper embedded, you are also stopping zookeeper. This is disruptive to zookeeper, and if you only have one zookeeper process, the entire cloud is effectively down while Zookeeper is stopped. Note that it *is* possible to run a redundant zookeeper ensemble with at least three Solr processes with an embedded zookeeper, but a Solr restart is disruptive to the ensemble.