Add apex machine configuration #131

Manually merged
rarias merged 13 commits from apex into master 2025-07-15 11:40:32 +02:00
Owner

The apex machine will be the new login. It has a HW RAID 5 that should keep the data safe in the case of a single disk failure (it has 4 in good condition). The RAID 5 also can provide good speed, but I suspect disks need to be TRIMed, but with HW raid is... complicated. For now we arrive at 7kIOPS on latency and about 90MB/s on bandwidth, however the RAID controller can provide 16Gbit/s and each disk 500MB/s, so we should be able to arrive to at least 1GB/s. In any case, the old login could only arrive to 12 MiB/s, so it is already an improvement.

We are connected via a 10Gbps/s link to the upstream switch, and we can sustain almost 2Gbit/s of download speed from the outside world:

apex% speedtest
Retrieving speedtest.net configuration...
Testing from Consorci de Serveis Universitaris de Catalunya (84.88.53.236)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Adamo (Barcelona) [0.15 km]: 7.245 ms
Testing download speed................................................................................
Download: 1704.75 Mbit/s
Testing upload speed......................................................................................................
Upload: 1762.36 Mbit/s

For now, we have:

  • 10 Gbit/s uplink
  • Redudant HW RAID 5 on / and /home
  • Full NAT from compute nodes, not more HTTP proxies.
  • Firewall protects us from external attacks (only port 22 is open).
  • Same /home that we had before. We also have /ceph (slow but larger).

We still don't have direct visibility to machines in the internal network to clone repositories, so we need to use help from other hosts.

This PR intention is to only setup the minimal parts to allow access to compute nodes. I still need to reconfigure a lot of services, but that can be left for other PR(s).

The [apex](https://dictionary.cambridge.org/es/diccionario/ingles-espanol/apex) machine will be the new login. It has a HW RAID 5 that should keep the data safe in the case of a single disk failure (it has 4 in good condition). The RAID 5 also can provide good speed, but I suspect disks need to be TRIMed, but with HW raid is... complicated. For now we arrive at 7kIOPS on latency and about 90MB/s on bandwidth, however the RAID controller can provide 16Gbit/s and each disk 500MB/s, so we should be able to arrive to at least 1GB/s. In any case, the old login could only arrive to 12 MiB/s, so it is already an improvement. We are connected via a 10Gbps/s link to the upstream switch, and we can sustain almost 2Gbit/s of download speed from the outside world: ``` apex% speedtest Retrieving speedtest.net configuration... Testing from Consorci de Serveis Universitaris de Catalunya (84.88.53.236)... Retrieving speedtest.net server list... Selecting best server based on ping... Hosted by Adamo (Barcelona) [0.15 km]: 7.245 ms Testing download speed................................................................................ Download: 1704.75 Mbit/s Testing upload speed...................................................................................................... Upload: 1762.36 Mbit/s ``` For now, we have: - 10 Gbit/s uplink - Redudant HW RAID 5 on / and /home - Full NAT from compute nodes, not more HTTP proxies. - Firewall protects us from external attacks (only port 22 is open). - Same /home that we had before. We also have /ceph (slow but larger). We still don't have direct visibility to machines in the internal network to clone repositories, so we need to use help from other hosts. This PR intention is to only setup the minimal parts to allow access to compute nodes. I still need to reconfigure a lot of services, but that can be left for other PR(s).
rarias added 4 commits 2025-07-09 11:56:07 +02:00
rarias added 1 commit 2025-07-09 12:06:32 +02:00
They need to be able to login to apex to access any other machine from
the SSF rack.
rarias force-pushed apex from 15b73eda4b to 58a64f64e0 2025-07-11 08:52:08 +02:00 Compare
rarias added 5 commits 2025-07-11 11:36:53 +02:00
We now have direct connection to them.
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.
Don't wait to flush writes, as we don't care about consistency on a
crash:

> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.
Otherwise they simply fail as IPv6 doesn't work.
rarias changed title from WIP: Add apex machine configuration to Add apex machine configuration 2025-07-11 11:56:07 +02:00
rarias requested review from arocanon 2025-07-11 11:56:15 +02:00
rarias requested review from abonerib 2025-07-11 11:56:15 +02:00
abonerib reviewed 2025-07-11 15:16:25 +02:00
m/apex/nfs.nix Outdated
@@ -0,0 +14,4 @@
# Check with `rpcinfo -p`
extraCommands = ''
# Accept NFS traffic from compute nodes but not from the outside
iptables -A nixos-fw -p tcp -s 10.0.40.0/24 --dport 111 -j nixos-fw-accept
Collaborator

We should add networking.nftables.enable = lib.mkForce false; in case they ever change the default to nftables.

We should add `networking.nftables.enable = lib.mkForce false;` in case they ever change the default to nftables.
Author
Owner

Sure!

Sure!
rarias marked this conversation as resolved
m/apex/nfs.nix Outdated
@@ -0,0 +29,4 @@
iptables -A nixos-fw -p udp -s 10.0.40.0/24 --dport 20048 -j nixos-fw-accept
'';
# Flush all rules and chains on stop so it won't break on start
extraStopCommands = ''
Collaborator

Seems that the nixos service drops its chains on start, so this may not be needed unless we have different chain rules? 9807714d69/nixos/modules/services/networking/firewall-iptables.nix (L63)

Seems that the nixos service drops its chains on start, so this may not be needed unless we have different chain rules? https://github.com/NixOS/nixpkgs/blob/9807714d6944a957c2e036f84b0ff8caf9930bc0/nixos/modules/services/networking/firewall-iptables.nix#L63
Author
Owner

Okay, I can change it. Let's see if we don't break the SSH.

Okay, I can change it. Let's see if we don't break the SSH.
Author
Owner

Seems to work ok.

Seems to work ok.
rarias marked this conversation as resolved
Author
Owner

I will also add the host SSH configuration so users can access GitLab without any extra configuration.

I will also add the host SSH configuration so users can access GitLab without any extra configuration.
rarias added 1 commit 2025-07-11 16:11:52 +02:00
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.

This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.
rarias added 2 commits 2025-07-11 16:18:16 +02:00
They are not needed as they are already flushed when the firewall
starts or stops.
rarias force-pushed apex from 4e9be9a8d3 to 9e4072f0aa 2025-07-11 16:26:51 +02:00 Compare
abonerib approved these changes 2025-07-14 17:21:48 +02:00
rarias force-pushed apex from 9e4072f0aa to 9e83565977 2025-07-15 11:22:04 +02:00 Compare
rarias manually merged commit 9e83565977 into master 2025-07-15 11:40:32 +02:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#131