1 |
Hi Dan, |
2 |
|
3 |
On 20 Oct , 2005, at 2:41 AM, Dan Podeanu wrote: |
4 |
> Interesting topic. |
5 |
|
6 |
Indeed, I'm moving to a different employer and was considering a |
7 |
similar setup... |
8 |
I'm curious about a number of things: |
9 |
|
10 |
What's the scale of the cluster you're using this setup on ? |
11 |
Would you be willing / able to share some of the work ? |
12 |
I'd be very interested to look at your setup before I start my own. |
13 |
|
14 |
Any comments on the hardware stability of the nodes you're using? |
15 |
Which make blades are you using? |
16 |
|
17 |
I was also wondering whether you are familiar with the work of http:// |
18 |
www.infrastructures.org/ |
19 |
Your setup has many of it's characteristics. |
20 |
|
21 |
> Objectives: |
22 |
> |
23 |
> 1. Low maintenance costs: maintaining and applying patches to a |
24 |
> single build |
25 |
> (Gentoo snapshots). |
26 |
> 2. Low scalability overhead: scalability should be part of the |
27 |
> design, it |
28 |
> should not take more than 10 minutes per server to scale up. |
29 |
> 3. Redundancy: Permanent hardware failure of N-1 out of N nodes, or |
30 |
> temporary failure (power off) of all nodes should allow fast (10 |
31 |
> minutes) recovery of all nodes in a |
32 |
> cluster. |
33 |
|
34 |
I read below that all nodes include configs for dhcp/tftp in order to |
35 |
be able to take over the golden (blade root) server. How do you |
36 |
handle that? In case of downtime of the main blade root server which |
37 |
of the nodes gets to take over? Is that an automatic or a manual |
38 |
process? |
39 |
|
40 |
Additionally, did you test a all node failure and how did the master |
41 |
blade root cope with the strain of all nodes booting at once? What |
42 |
hardware are you using for the blade root server ? |
43 |
|
44 |
> Restrictions: |
45 |
> |
46 |
> 1. Single CPU architecture: I consider the cost of maintaining several |
47 |
> architectures to be bigger than the cost of purchasing a single |
48 |
> architecture. |
49 |
|
50 |
Are you running a full 64-bit setup or 32-bit compatibility mode ? |
51 |
What are your experiences with stability in 64-bit case ? Especially |
52 |
curious about php and it's diverse set of external libs. Do agree |
53 |
though, any thoughts on the inevitable upgrade that's going to show |
54 |
up some time in the future when your current hardware platform is no |
55 |
longer available ? |
56 |
|
57 |
> 2. Unified packages tree: I consider the cost of maintaining |
58 |
> several Gentoo |
59 |
> snapshots just to have deployed the minimum of packages per server |
60 |
> assigned to a specific |
61 |
> application (mail server, web server etc.) to be bigger than having |
62 |
> a common build with all packages and just starting the required |
63 |
> services (ie. all deployed servers have a both a MTA and Apache |
64 |
> installed, just that web servers have Apache started, and the mail |
65 |
> servers have it stopped and MTA running instead). |
66 |
|
67 |
Agreed, doesn't pay off to have seperate base-sets for the different |
68 |
type of nodes, and it's good on redundancy, if needed a former |
69 |
webserver can stand in as a database server etc.. |
70 |
|
71 |
> 3. An application that can act as a cluster with transparent |
72 |
> failover (web |
73 |
> with balancer and health checking, multiple MX servers, etc.) |
74 |
|
75 |
I don't understand this restriction? |
76 |
|
77 |
> 4. A remote storage for persistent data (like logs) helps (you will |
78 |
> see why); |
79 |
> you can modify the partitioning or harddisk configuration to |
80 |
> maintain a stable filesystem on individual servers. |
81 |
|
82 |
<snipped> |
83 |
|
84 |
> Software: |
85 |
> |
86 |
> One initial server (blade root) is installed with Gentoo. On top of |
87 |
> that, in |
88 |
> a directory, another Gentoo is installed (Gentoo snapshot) that |
89 |
> will be replicated on individual |
90 |
> servers as further described, and all maintenance to the snapshot |
91 |
> is done in chroot. |
92 |
> |
93 |
> The Blade root runs DHCP and tftp and is able to answer PXE dhcp/tftp |
94 |
> requests (for network boot) and serve an initial bootloader (grub |
95 |
> 0.95 with diskless and diskless-undi patches to allow detection of |
96 |
> Broadcom NICs), along with an initial initrd filesystem. |
97 |
> |
98 |
> The Gentoo snapshot contains all the packages required for all |
99 |
> applications |
100 |
> (roughly 2gb on our systems), along with dhcp/tftp and configs, to |
101 |
> allow it to act as Blade root. |
102 |
|
103 |
See question above, is switching manual ? |
104 |
|
105 |
> In addition, the Blade root contains individual configurations for |
106 |
> every |
107 |
> individual deployed server (or, rather, only changes to the |
108 |
> standard Gentoo config, ie. per-blade IPs, custom application |
109 |
> configs, different configuration for services to start as boot, etc.) |
110 |
|
111 |
Do you use classes here (e.g. webserver, databaseserver, mailserver, |
112 |
cachingserver etc.)? |
113 |
Or do you maintain individual setups for each server? |
114 |
What scripting language did you choose for the config scripts and |
115 |
stuff and why that script lang? |
116 |
|
117 |
<booting process snipped> |
118 |
|
119 |
I'm also curious as to what QA procdures you have in place to prevent |
120 |
accidental mistakes on the blade root server. I assume you test |
121 |
beforehand ? On all server classes ? Modifications to the third |
122 |
archive with the per-server config seem rather difficult to test. |
123 |
|
124 |
> I hope this helps. |
125 |
|
126 |
Oh it sure did, it confirmed some ideas i was already thinking about |
127 |
and gave me a real world example that it can be done :-) |
128 |
|
129 |
Thanks, |
130 |
|
131 |
Ramon |
132 |
-- |
133 |
Change what you're saying, |
134 |
Don't change what you said |
135 |
|
136 |
The Eels |
137 |
|
138 |
|
139 |
|
140 |
-- |
141 |
gentoo-server@g.o mailing list |