Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Wiki Markup
The Grid Agents are Java processes installed and running on each machine that listen (on a configured port) for commands from a Grid Manager application (Web or  CLI) and act upon them by interacting with the local Tomcat instances. No  encryption is envisioned on the network channels, since all these machines are  considered to be installed on a secured network segment (at least in prod).  \[Maybe we should reconsider this\]

All Tomcat instances (including the Web Grid Managers), as well as the Grid Agents, are installed manually. The primary grid manager and all the agents are started manually too.

Once the Web Grid Manager is started up, machines and instances can be registered in it, so they can become manageable. No centralized (comprehensive) provisioning is envisioned until later versions of Tomcat Grid (see below).

The Grid Agents are processes that can also be managed. In particular, status can be obtained (and showed) from them, and basic operations (start/stop/kill) can be triggered on them. These operations are, however, heavily dependent on OS and OS capabilities (configuration, installed tools, etc.) and the infrastructure architecture (fire-walled machines, network VLans, etc).

Collectively, Tomcat instances and Grid Agents are "services" since both can be managed.

Later versions of the Grid include "collection" management. This allows to group subsets of services (Tomcat instances and Grid Agents), so they can be operated as whole. Each collection can include plain services, or other collections (recursively).

Considering all the above, the following phases could be considered as a base line for the road map of the Tomcat Grid.

Phase 1 - Core Grid Operation

The first phase includes the most basic features, in order to provide a functioning and useful first version of the Grid.

In particular, no Tomcat instances or Grid Agents automatic provisioning is considered, no configuration GUI (only pre-configured XML config files), no WAR deployments, no command-line interface, no complex grid operations, no secondary managers, and no collections.

Included features are:

markt: I think this needs to be reconsidered.

vlad: Sure. My take was to use unencrypted connections, since it's fast to implement for me and I don't know well how to implement a secured connection :/ . Maybe we could even offer both options. Anyway, why do you consider we need an secured connection? In my experience, I've never needed them in dev/state/QA/test, and in prod the environment are isolated. Well... I guess there are other scenarios I haven't been exposed to.

All Tomcat instances (including the Web Grid Managers), as well as the Grid Agents, are installed manually. The primary grid manager and all the agents are started manually too.

Once the Web Grid Manager is started up, machines and instances can be registered in it, so they can become manageable. No centralized (comprehensive) provisioning is envisioned until later versions of Tomcat Grid (see below).

The Grid Agents are processes that can also be managed. In particular, status can be obtained (and showed) from them, and basic operations (start/stop/kill) can be triggered on them. These operations are, however, heavily dependent on OS and OS capabilities (configuration, installed tools, etc.) and the infrastructure architecture (fire-walled machines, network VLans, etc).

Collectively, Tomcat instances and Grid Agents are "services" since both can be managed.

markt: Managing the agents strikes me as making this significantly more complex. Operating systems have tools to ensure particular services are running and are restarted if they fail. What is the benefit of pulling this into this tool?

vlad: I agree that agents can run as daemons and be up all the time. However, I've seen many "robust" programs to have memory leaks and that for one reason or another stop working normally after some time. Maybe after a couple of days, or weeks, or months. Maybe it would be useful to restart them just before a critical operation such as a deployment. I agree that managing agent is more expensive since it requires the development and maintenance of multiple OS-dependent implementations.

Later versions of the Grid include "collection" management. This allows to group subsets of services (Tomcat instances and Grid Agents), so they can be operated as whole. Each collection can include plain services, or other collections (recursively).

Below is a general overview of the software modules and their responsibilities.

TomcatGridSoftwareModules.png!

_The modules are:

  • Core module: Shared logic to be used by all other modules.
    • Core data types, such as "Machine", "Instance".
    • Share core logic. For example, grid configuration file parsing/update.
    • Defines the Grid Agent functions (as interfaces), but does not implement them. These are used by all Managers modules.
    • Common utility classes.
    • Common exceptions.
  • Web Manager module: A JEE Web application that includes:
    • Includes a simple Managing web GUI: web pages, navigation logic.
    • Uses the Core module for functions such as:
      • Load Grid Configuration,
      • Interact with grid agents.
  • CLI Manager module: A java command-line program:
    • Command-line interface: command parsing, text output.
    • Uses the Core module for functions such as:
      • Load Grid Configuration,
      • Interact with grid agents.
  • Any other Manager module: Any future module that needs to connect to Grid Agents to manage Tomcat instances.
  • Grid Agent module: Responds to Managers calls and controls local Tomcat instances.
    • Listen to Manager requests.
    • Implements the Grid Agent interfaces.
    • Includes the high-level interaction with Tomcat intances.
    • Defines and uses the Tomcat Management Primitives (as interfaces), but does not implement them.
    • Receives content (deployables, grid configuration changes) and applies them.
  • Grid Agent Primitives for Linux:
    • Implements the Tomcat Management Primitives for Linux OS.
  • Grid Agent Primitives for Windows:
    • Implements the Tomcat Management Primitives for Windows OS.
  • Grid Agent Primitives for Mac:
    • Implements the Tomcat Management Primitives for Mac OS.
  • Grid Agent Primitives for Other:
    • Implements the Tomcat Management Primitives for Other OS.

The executables themselves are comprised of several modules each that are assembled during the build.

  • The Grid Web Manager Executable (a WAR) includes:
    • Core module
    • Web Manager module
  • The Grid CLI Manager Executable (a JAR) includes:
    • Core module
    • CLI Manager module
  • The Grid Agent Executable (a JAR) includes:
    • Core module
    • Grid Agent module
    • Grid Agent Primitives for Linux
    • Grid Agent Primitives for Windows
    • Grid Agent Primitives for Mac
    • Grid Agent Primitives for Other_

Considering all the above, the following phases could be considered as a base line for the road map of the Tomcat Grid.

Phase 1 - Core Grid Operation

The first phase includes the most basic features, in order to provide a functioning and useful first version of the Grid.

In particular, no Tomcat instances or Grid Agents automatic provisioning is considered, no configuration GUI (only pre-configured XML config files), no WAR deployments, no command-line interface, no complex grid operations, no secondary managers, and no collections.

markt: This raises another architectural question. Wouldn't this be more scalable if agents were configured with the location of the primary manager and registered themselves? The manager could persist that registration so an agent would have to be explicitly removed if it was taken off-line permanently.

vlad: Definitively. A basic agent install, and registration could "summon" all the necessary software from the primary manager: i.e. Tomcat executables, grid configuration, etc. I thought about something like this as an advanced feature (phase 12), but we can rearrange the priorities if needed.

Included features are:

  1. The Web Grid Manager presents a Web interface that shows information of the whole Grid and present simple buttons to operate the Tomcat instances.
  2. The managing logic must be clearly separated from the Web interface logic, since later on, a Command-Line Grid Manager will be included, and will use the same managing logic.
  3. The available commands for each instance are:
    • status: retrieves the status of a Tomcat instance through the corresponding Grid Agent
    • trigger-start: sends a start request to the Tomcat instance using the corresponding Grid Agent
    • trigger-stop: sends a stop request to the Tomcat instance using the corresponding Grid Agent
    • trigger-kill: sends a kill request to the Tomcat instance using the corresponding Grid Agent
      • markt: A small thing. I think I'd prefer start/stop/kill/
      • vlad: I added these commands on phase 7 and, as I see them, they behave are a little bit different from the trigger ones, specially when we refer to the CLI manager. The trigger-start, issues the signal to the Grid Agent and ends. The start keeps on working (and updating the user) until the Tomcat instance is actually up (or fails to start up). On the Web interface the trigger start could show up as a simple icon (omitting its name). On the CLI I see the start command as far more useful than the trigger-start.
  4. Wiki Markup
    A simple configuration file lists all the machines and their instances so the Grid knows where each instance resides. \[This configuration file is probably in XML format\]
    \\

  5. Grid Agents are installed on each machine and manage all instances in that machine pertaining to the Grid. Grid Agents receive commands from any manager and act accordingly. To manage the instances the Agents use:
    • Shell calls: start an instance, kill an instance.
    • JMX calls to retrieve instance live information.
    • JMX calls to change instance live values, and to request instance shutdown.
    • OS calls for any OS related need.
  6. It's assumed that a port will be accessible from each Grid Manager to each machine where the Grid Agents are serving. The firewall, if present must allow active server-type sockets on that port.
    • markt: Another architectural question. Which end opens the connection, does it stay open and which protocol is used? For example, agents connect to Manager via WebSocket.
    • vlad: I always considered the Grid Agent would would open a server NIO socket, to avoid using ephemereal ports. The Grid Agent is always listening, and the Managers connect when needed. In terms of protocol, it could be a ad-hoc one, specially developed for this tool, or use a well-known standard. I have ad-hoc one that I can use, but I'm open to suggestions
  7. The Web Grid Manager presents a Web interface that shows information of the whole Grid and present simple buttons to operate the Tomcat instances.
  8. The managing logic must be clearly separated from the Web interface logic, since later on, a Command-Line Grid Manager will be included, and will use the same managing logic.
  9. The available commands for each instance are:
    • status: retrieves the status of a Tomcat instance throught the corresponding Grid Agent
    • trigger-start: sends a start request to the Tomcat instance using the corresponding Grid Agent
    • trigger-stop: sends a stop request to the Tomcat instance using the corresponding Grid Agent
    • trigger-kill: sends a kill request to the Tomcat instance using the corresponding Grid Agent
  10. Wiki Markup
    A simple configuration file lists all the machines and their instances so the Grid knows where each instance resides. \[This configuration file is probably in XML format\]
    \\
  11. Grid Agents are installed on each machine and manage all instances in that machine pertaining to the Grid. Grid Agents receive commands from any manager and act accordingly. To manage the instances the Agents use:
    • Shell calls: start an instance, kill an instance.
    • JMX calls to retrieve instance live information.
    • JMX calls to change instance live values, and to request instance shutdown.
    • OS calls for any OS related need.
  12. It's assumed that a port will be accessible from each Grid Manager to each machine where the Grid Agents are serving. The firewall, if present must allow active server-type sockets on that port
    • .
  13. Multiple Grids (and Grid Agents) can be running on the same set (or subset) of machines. If that's the case, Tomcat instances, and Grid Agents run on different ports for each grid. When multiple grids use the same machines they don't interfere with each other and can be operated simultaneously.
  14. The status command shows the following information for each instance:
    • Machine
    • Service (a unique grid-wide name for each instance)
    • State
  15. The state of an instance can be:
    • Wiki Markup
      *Active*: the instance OS process exists, the instance is serving requests, and it looks healthy \[enough\].
      \\

    • Wiki Markup
      *Zoetic* \[for lack of a better word\]: the instance OS process exists, but the instance is unresponsive and it doesn't respond to requests for state. It's probably not serving any HTTP requests, does not look healthy, it may be starting, it may be shutting down, it may be overwhelmed. Who knows.
      \\

    • Stopped: the instance OS process does not exist, and therefore the instance is not operating at all.
    • Not Available: This is a pseudo state that the manager applications (web and cli) show when a Grid Agent does not respond to requests for status in a timely manner.
    • If possible it would be great to discern different sub cases of the Zoetic state, so to help the user to determine what's going on and tackle the case accordingly: and tackle the case accordingly:
      • Starting: The Tomcat instance process exists, and the instance is starting. It's not yet serving HTTP requests.
      • Stopping
      • Starting: The Tomcat instance process exists, and the instance is startingstopping. It's not yet no longer serving HTTP requests.
        Stopping
      • Unresponsive: The Tomcat instance process exists, but the instance health isn't good, it's not responding to HTTP requests, or it's overwhelmed. It's not even responding to requests for status. This state is different from "Not Available" since in this case the Grid Agent IS active and responding, but the Tomcat instance is stopping. It's no longer serving HTTP requests.
        Unresponsive: The Tomcat instance process exists, but the instance health isn't good, it's not responding to HTTP requests, or it's overwhelmed. It's not even responding to requests for statusitself is unresponsive.
      • On second thoughts, these extra states can actually be discerned today with the current version of Tomcat, since the Grid Agents know all the trigger commands each local instance has received and can deduce (or make up) the sub case. If the Grid Agent is restarted, some kind of persistence of its state might be needed to "remember" what was going on before the Grid Agent was shut down, so to make an educated guess.
  16. Grid Agents communicate over unsecured TCP sockets, and assume communication security is enforced by the network architecture (segregated segments/VLans).
  17. The "trigger"-type commands just deliver the corresponding signal to the instance's Grid Agent and returns right away, without waiting for the full operation to complete. It's kind of "fire and forget". The web user can keep on refreshing the the web interface to find out about the progress of the status of the Tomcat instances.
  18. Wiki Markup
    Simple user name/password authentication is implemented to secure the Web interface. \[Maybe we'll need to provide more options\]
    \\

Phase 2 - Manageable Grid Agents

...

  1. The status command now adds more information for each service (Tomcat instances and Grid Agents):
    • CPU usage (if possible)
    • CPU load (if possible)
    • Head Heap usage (if possible)
    • Threads (if possible)
    • Started on (if possible)
    • Any other information deemed useful for managing purposes.
  2. Wiki Markup
    \[Optional\] Machine information (same page, or maybe an extra tab) shows per machine:

    • CPU usage
    • CPU load (1 min, 5, min, 15 min)
    • Memory usage
    • File system space usage for the mount where the "webapps" dir is (can this be different per instance?).

...

  1. In addition to the Web Grid Manager interface, the Command-Line Grid Manager interface is suitable when the web interface cannot be used. Typical cases are, when no web port is available on the servers (probably fire-walled), when the security policies do not allow remote server operations, etc. This maybe the case on some secured/fire-walled production environments where only text sessions are accessibleacceptable.
  2. The Command-Line Grid Manager is also suitable for automation (e. g. the weekly full/partial site restart) when unattended operations are scheduled, using cron or equivalent utilities.
  3. The Command-Line Grid Manager always leaves a log file per command execution on a directory created for this purpose. Each log file's name includes the time stamp, the command name, and (if possible) the arguments.
  4. The implemented commands are:
    • status
    • trigger-start
    • trigger-stop
    • trigger-kill
  5. The trigger commands are only executed when necessary. If an instance is already running a trigger-start command will be ignored. Conversely trigger-stop and trigger-kill commands are ignored when the instance is stopped.
  6. Return codes must be strategically defined to allow automation. Well defined return codes can provide useful information to the caller program/process (especially for automation), so it can clearly identify the problem and act accordingly.

...

Included features are:

  1. Hooks are integration points to include extra activities we want to be performed when some events occur on each instanceTomcat Instance or Grid Agent. A hook program is linked to a hook and may be implemented as a shell scripts (script or other) and is linked to one of any other kind of executable program. Hooks can be defined for the following events:
    • pre-trigger-start
    • post-trigger-start
    • pre-trigger-stop
    • post-trigger-stop
    • pre-trigger-kill
    • post-trigger-kill
  2. The hooks are only executed when the corresponding signal is not ignored. For example, if a trigger-start is issued and the instance is stopped, the corresponding pre-trigger-start and post-trigger-start hooks are executed. If the instance was running, then the command would be ignored and its hooks would also be skipped.
  3. Hooks can be useful for many purposes. For example, typical uses are:
    • Prepare an instance configuration.
    • Record instance events.
    • Send emails or other notifications upon restarts.
    • Clear caches & temp dirs before starting an instance.
    • Delay the start of an instance to allow the OS to reclaim resources.
    • Generate thread dump dumps on specific events.
  4. Hooks scripts programs run on the machine where the affected instance runs. Therefore, the hooks script are programs need to be copied and are ready prepared (manually or automagically) to be executed on all machines of the grid.
  5. When hooks programs are registered (maybe uploaded) on the Grid they are automatically distributed behind the scenes to all instances/machines before they are ready to usebe used.

Phase 7 - Enhanced Grid Operation

Beyond the basic trigger operations, there's usually need for more complex ones , that provide very common needs but are seldom formally implemented.

Included features are:

  1. Non-trigger commands are added to both the Command-Line and Web interfacesGrid Managers:
    • start: triggers a start and waits until the operation succeeds or fail
    • stop: triggers a stop and waits until the operation succeeds or fail
    • kill: triggers a kill and waits until the operation succeeds or fail
    • restart: triggers a stop, waits until it stops, triggers a start, wait until it starts
    • killstart: triggers a kill, waits until it stops, triggers a start, wait until it starts waits until the operation succeeds or fail; with configurable restart delaykillstart: waits until the operation succeeds or fail; with configurable restart delay
  2. The new commands operate on both types of services (instances and agents).
  3. New hooks are added for the new commands:
    • pre-start
    • post-start
    • pre-stop
    • post-stop
    • pre-kill
    • post-kill
    • pre-restart
    • post-restart
    • pre-killstart
    • post-killstart
  4. All these new commands use the "trigger commands " primitives behind the scenes.
  5. The hooks for the non-trigger events are never ignored, so they the hook programs are executed even if the related trigger commands are ignored.
  6. Automatic trigger-kill operations are now be automatically issued for stop and restart operations if configured, when a trigger-stop fails to succeed in the pre-configured time limit of time. The time limit is now optionally specified in the configuration file on a per-service basis.
  7. A restart delay (now optionally specified on a per-service bases on the configuration file) is used when restarting services: it's applied to the restart and killstart commands.
  8. The non-trigger commands show an update of the service state periodically (defaults to every 10s, and can be specified on the configuration file), and they keep working until the full operation completes.

...

  1. New commands:
    • deploy: deploys a web application (a WAR) to a specific or all grid instances
    • undeploy: undeploys a web application to from a specific or all grid instances
  2. Through this these operations Tomcat instances will be able to run multiple web applications.
  3. The status command is revamped so it now lists all war applications deployed on each instance.

...

  1. The following commands can now be issued on collections in addition to plain services:
    • status
    • trigger-start
    • trigger-stop
    • trigger-kill
    • start
    • stop
    • kill
    • restart
    • killstart
    • deploy
    • rollback
    • deploystop
    • undeploy
  2. Hooks are modified to provide information of the collection they are affecting.
  3. When a hook runs on a collection, it runs on the machine where the manager application (web or cli) runs, not remotely on the machine of any other instance. This is because in this case the execution is not tied to a specific instance, but to a collection.
  4. Services defined in a collection can be managed in sequential or parallel modes. For example, a restart command on a sequential collection will restart the second service only, when the first one has fully completed. Once the second completes, it will restart the third one, and so on. A parallel collection would issue a restart on all services simultaneously.
  5. Collections are defined in the configuration file and are of a recursive nature: a collection can include plain services, other collections, or both. For sequential mode, each "sub-collection" is treated as a single element so it's considered fully complete when all its included services and collections complete.
  6. Wiki Markup
    \[To be analyzed and defined if it's useful or not\] Collections editing through the Web interface. This can be useful to graphically update collections when machines/instances are added/removed.
    \\

Phase 11 - Instance Configuration

...

  1. The provisioning operation will automate the following tasks:
    • Login into a machine
    • Installing the Grid Agents
    • Configuring & running the Grid Agent
    • Installing the Tomcat instances
    • Configuring the Tomcat instances
    • Configuring the environment (shell variables, other)
  2. Using the Web and Command-Line interfaces the user can provision the Grid. Typical operations can be:
    • Adding a new machine to the Grid.
    • Removing a machine from the Grid.
    • Adding a new instance to a machine.
    • Removing an instance from a machine.
  3. Once new machine is registered, the machine's Agent is installed and executed.
  4. To deregister a machine all instances must have been removed first.
  5. If a machine is deregistered, the Agent is stopped and optionally uninstalled. Maybe we'll leave it there, so it will be easier in the future to re-provision the machine.
  6. Once a new instance is created the following operations are performed:
    • Registering the machines on the grid configuration file
    • Standard instance's directory tree is copied
    • All the instance extra configuration (libraries, JDBC data sources, etc.) are performed
    • No deployments are installed yet.
  7. To remove an instance, all deployables must have been undeployed first.
  8. Once an instance is removed:
    • The instance is removed from the configuration file and any collection that included it
    • All deployments are removed from it
    • The whole directory tree for it is removed on the remote machine
  9. The provisioning operations require remote access to the new machine, and therefore some kind of connections needs to be setup. For example, an SSH connection could be used if the user provides the user name/password credentials or if if an ssh key exchange had been previously setup between the machines.

Phase 13 -

...

Wiki Markup
\[To be described\]

...

Additional Commands

Wiki Markup
\[To be described\]

...