Hi Spiceworkers!

Our company is a:

  • SMB in international logistics sector - about 15-30 personnel;
  • 2 sites with on-site workers and after hours off-site;
  • our management has decided to host everything on-premises except for our emails;
  • main usage and in order of importance, database where we host our in-house ERP, Windows file server, and minor use of web server for internal use which we may extend it to our customers (public facing) in a couple of years time;
  • OS is a mix of Linux and Windows, our host is Hyper-V with all workloads virtualised;
  • backups made by Veeam. We have 3 servers altogether, production, backup and DR which is stored at our 2nd site. Nightly backups to both backup and DR servers. Replicas are made at every 3 hour intervals to these two servers.

Current situation

  • Our current servers are HPE ML350 Gen10 will be out of hardware support 3rd quarter next year, management has budget to invest but as an SMB we are looking for long term usage perhaps 7 year horizon to minimise our efforts in rethinking hardware refreshes;
  • Can’t complain much about our current infrastructure, it works and has decent performance for SAS 10K HDD. But I can say that SSDs (even SATA SSD) would greatly help us.
  • Our file server hosts small but numerous PDFs/JPGs for our ERP (no videos/audio). I have calculated even with factoring for growth it will take us years before we reach 1TB usage. Database server has a small footprint without ERP software optimised but Accounting software is of-the-shelf. I can only say that performance of Accounting software is the software itself and nothing to do with hardware. With the file server, I can’t say as I have no metrics.
  • We have no on-site IT staff, only person that administrates the IT hardware/software is myself and am not full time on IT. Am able to handle our simple setup even in times of disaster - we had a situation where production server RAID card failed and we decided to failover production workloads to our backup server within 5 hours (only because we had to assess whether it was worth it to failover). We lost about 1 hour worth of data. But we lost a couple of days behind the scenes to bring our production server online in a normal working state - which users did not feel the impact of this (they only felt the 1 hour lost of data). So far, this has been the only disaster over the 7 years.

Considerations

  1. Invest in backups/backup strategy - especially immutable backup. It has saved us during the RAID card failure.
    Veeam has vendors that supply out-of-the-box immutability and am considering this over Linux Hardened Repository.
    Out-of-the-box immutability Pros:
  • Easy to setup considering we have no on-site IT staff;
  • Firmware/software is covered by the supplier so it should be able to update easily without tinkering with Linux.

Out-of-the-box immutability Cons:

  • These vendors are not big in the country am in so in cases of a hardware issue, how shall we be supported?
  1. Invest in SAN vs continue with existing infrastructure direct attached storage
    SAN Pros:
  • High RTO and won’t have to panic over down time;

SAN Cons:

  • Complexity, I have never managed such infrastructure
  • Added cost

Having said the above:

  • Reseller is offering IBM Flashstorage 5015 (4680-2P4) SAS SSD at a price point better than HPE MSA 2070 Gen7 with proof of concept. My considerations with SAN storage is latency over the connectivity between compute and storage for a SMB. Database and file server are critical to me.
  • I had already in mind to use Boot Optimised Storage RAID1 solution to minimise downtime to bring back servers into their working state. If I went with SAN storage I think the cost of storage drives would reduce but with an added advantage of having higher RTO due to clustering.
  • I would need atleast 10GB SFP capable switches to create a backbone network infrastructure for the servers to run backups/recovery/SAN storage so that it doesn’t interfere with other VLANs.

Is this all worthwhile especially if I have no background knowledge in SAN infrastructure. Sure I may learn the GUI, but I don’t know the intricate details if something were to go wrong. In-house knowledge helps when something goes wrong. Am just wondering if SAN infrastructure will reduce my time from monitoring hardware or will it create more headache. I currently don’t spend much time on hardware and I try my best to simplify infrastructure so that I can easily manage it.

Could someone give their thoughts and advice on this considering our background.

2 Spice ups

I would go for CTERA cloud combined whith an edge filer for the main site for SMB data on endpoints and Acronis or similar agent server side backing up on cloud, this way you can choose based on your needing if you want to save money on s3 storage or use, even for some parts, a different kind of storage for quicker access to file backed up

1 Spice up

What is the total amount of storage for all the production servers?

How many VM servers? How many host servers?

Does the org have a defined RPO/RTO?

From initial description, anyone trying to sell you a SAN probably shouldn’t be trusted. I highly doubt they are trying to look after your best interests or provide you with the best value solution.

2 Spice ups

Appreciate the detailed post; sounds like you’ve built a solid and thoughtful setup already, especially considering you’re solo on IT duties and it’s not even your full-time role. Respect.

On your SAN questions, I’ll say this up front: SANs can be great for performance and redundancy, but they come with overhead: learning curve, troubleshooting, support complexity, and potentially more things to go wrong unless you’ve got someone hands-on regularly. Given your current setup works and your workloads aren’t massive, I’d think twice before going that route unless you’re planning for major growth or high uptime SLAs. Sometimes simplicity really is your friend, especially in SMB environments with minimal IT staffing.

Now, on the immutability side: the out-of-the-box options you’re considering (like Veeam-compatible appliances) are honestly good choices if you’re looking for a plug-and-play approach. You’re right that vendor support matters. If these aren’t big names in your country, support could lag when things go sideways — something to weigh seriously. That said, lesser-known brands often come with much lower price tags. It’s not uncommon to find them offering better deals than HPE or Dell, because they don’t have the same brand premium. If you’re okay with maybe a bit slower support in exchange for savings and ease-of-use, could be worth it.

As for backup storage in general, it’s good you’re already thinking long-term. Some cheap and effective cloud-based options worth looking at:

  • Backblaze B2: Dirt cheap and dead simple. Works great with Veeam.
  • Wasabi: No egress fees, S3-compatible, and decent performance.
  • QSE Group: Decentralized, quantum resilient zero trust storage with immutability; lightweight, secure, and low cost.
  • IDrive Cloud: Straightforward setup and solid performance for backups, good value.
  • Synology C2 Storage: Ideal if you already use Synology hardware, easy integration.

What is the total storage size for your VMs ?
But if the total storage is less than 10TB or within RAID 5 8 SSDs (eg 8x 1.92TB), I would not consider a SAN for SMB…

  • the cost of SAN and its accompanying appliances (10Gbps switch etc) may not be cheap
  • Most 1U Dell or HPe servers have 8 2.5" SSD slots, even in RAID 6, can easily provides 10TB raw space.
  • Leverage on Veeam Backup & Replication 12.x (I hope you are using this “veeam” product)
    i. you can backup to a NAS,
    ii. replicate VMs from backup data sets to 2nd host
    iii. use Veeam Backup Copy to copy backup data sets to repository at remote site
    iv. also create replica in remote site using that backup data set
    v. for all of the backup method mentioned above use Increment or Reverse Increment (to be removed soon).

Can I have more information ?

  1. You mentioned 2 sites, where is your data center or server room(s) ? Are there servers (hosts) in both sites ?

  2. Are you using Veeam Backup & Replication for Hyper-v ? What is your current backup design ?

  3. How much storage is being used by your Server 20xx with hyper-v role hosts ??

  4. Do you really need a SAN ?
    It is a common misconception that VMs or data on SAN is highly protected. In certain cases like hosts suddenly dying (or powered off) could lead to VM corruption or files corruption (VMs are literally large files).

  5. Do you really need immutability (for backups and/or backup repository) ? Sadly, I would not add-in certain “marketing” features if you do not really need.

For SMBs (we have certain subsidiaries running small businesses)…this is what I came up for them…but we are running VMware, so in a way the “set up” can be used as a reference. If you are “upgrading” from RAID 10 SAS disks, even the “slowest” Dell Read Intensive SSDs in RAID 5 would show you wonders…literally 100x to 500x faster.

  1. 2 units Dell 1U Servers with RAID 5 1.92TB SSDs (8 x 2TB) for VMs
  2. 1 unit Dell 1U Server with RAID 5 1.92TB SSDs (8 x 2TB) as “standby host”
  3. Veeam Backup & Replication 12.x (VBR)
  4. A few units of Synology 620Slim (RAID 5, 6x 4TB SSDs) for VBR repository
  5. 1 or 2 Dell 1U Server with RAID 5 1.92TB SSDs (8 x 2TB) as “standby host” in remote site (these are old servers)
  6. A few units of Synology 620Slim (RAID 5, 6x 4TB SSDs) for VBR “remote site” repository

Currently we are using VBR with reverse increment backup (we are looking at new methods as reverse increment backup is going to be removed by VBR).

  1. All VMs are backup to local Synology NAS (GFS etc would be another topic)

  2. Critical VMs are replicated to the standby host from Backup data set (we do not use live VM as this also acts like a verification that backup can restore to a VM).

  3. Veeam SureBackup & Veeam SureReplica runs on all the backup jobs and replication jobs. These verification jobs ensures that the backups & replicas can be restored and certain scans like file integrity, networking & heartbeats are performed. AV scans during the verification process is optional.

  4. Veeam Backup Copy is used to copy the backup data sets to remote site (1st backup is always a full backup, so it will take more time)

  5. VBR to replicate critical VMs to the servers in remote site using the Backup Data on remote site NAS

So now you have backups, replicas using backup data sets, remote site backup and replica of VMs in remote site as well ?

Depending on how long it takes to create the synthetic full backup and to create the replica, you can create as many replica (or overwrite) as long as there is enough time for the previous jobs to complete.

Thanks for the replies. Let me reply your questions cohesively

  1. Our current setup is 1 Production, 1 Backup, 1 Disaster recovery server (3 physical hosts)

  2. Current storage on Production host is 0.5TB. This number could grow because we just started using our ERP and are doing away with a lot of paper/relying a lot on paperless technology. Calculations and based on 2 months worth of data (although I don’t think it is 100% representative of what I calculated since users are adapting and transitioning) tells me it would take me 100 years to reach 1TB. I don’t think we’ll grow tremendously since we have no videos or audio. Files are numerous small pdf/jpg. I know that we’ll not surpass using 24 bays of SSD.

  3. The above statement “Reliance on paperless”. The only reason why I had considered a SAN was uptime and not performance. I didn’t want users to get comfortable with the system, have a major disaster and then nobody will trust using the system. For us, in our industry, in our market, I cannot say we have a large volume of transactions in comparison with large players, our issue is random transactions where we cannot predict when we need to use the system. People just want something that works - all the time.

  4. RPO/RTO. About 6 years ago the vendor who proposed this architecture proposed us a RPO/RTO 24 hours/16 hours. However, we have not discussed this RPO/RTO since then with management. I believe that due to the emergence of our ERP software, a lot of our transactions and way we process our jobs are converging to this system which makes it even more important hence we do need to reassess our RPO/RTO. Based on live situation and as mentioned in my first post with the RAID card failure, I was able to achieve RPO 1 hour (a little bit of luck since I have replicas every 3 hours and so happened RAID card failed 1 hour after a successful replica was made), RTO 5 hours (and this was due to some time taken to assess what we should do given the circumstances). Recovering from replica was only less than an hour. I did the above recovery on my own. But I was without my Production server for close to 2 weeks as the hardware support couldn’t get a replacement quickly and I had to rebuild the host OS and move everything back to Production. I had no backups for that 2 week period as I don’t have a Veeam license to make backups of my Backup server to my Disaster Recovery server - so that was daunting.

  5. In my Production site I have a server room with redundant split air-conditioning with Production and Backup servers on this site. In my Disaster Recovery site I only have a locked server rack and at times could get hot and can get dusty with one Disaster Recovery server.

  6. I use Veeam 12.2.

  • Veeam is installed on the host OS with it being off the domain
  • Backup server makes daily forward backups to its own internally attached storage. Surebackup runs to check this backup
  • Backup server sends a daily copy of backups to Disaster Recovery server
  • Backup server makes replicas every three hours between 0900 - 1500 hours to its own internally attached storage
  • Disaster Recovery server makes replicas every three hours between 0900 - 1500 hours to its own internally attached storage.

I have also tested in production, replica of Disaster Recovery server. So I know the above backup/replication works in live situations. But it is a little stressful when everybody is pressing you why isn’t this and that working, when can we get it working.

I didn’t know SureReplica works for Hyper-V.

Closing remarks
We have a budget to invest in IT, however, I am also not convinced that a SAN is in our best interest. Here is my reason why:

  1. I have no experience with SAN. From my experience it would be best to have some in-house knowledge in case something goes wrong rather than fully relying on third party vendor to come in and support which at times can be really frustrating.
  2. I like to keep things simple for quick troubleshooting and ease of management. I don’t think a SAN will allow for this.
  3. To get similar performance as on-board direct attached storage, I would need to invest tremendously on SAN. Whereas, moving from RAID 5 SAS 10K HDD to RAID 10 SATA SSD would enable us to have higher redundancy and faster performance for a fraction of the cost.

I’ve used SANs for multiple environments. It only makes sense when you have multiple hosts within a production site. You have a single production host, single backup host, and single DR host, correct?

If you didn’t need a SAN from the start, you won’t need it now unless you want/need multiple hosts with shared storage, but it doesn’t sound like you do.

Just my opinion on that part of your question.

But SAN have nothing to do with your production set up or your backup setup ?

Your servers are maybe using 1.2TB (if based on RAID 10 with 1 hot spare) ?
Furthermore, are you really saying that all of your VMs only use 500GB ?? Even Windows Server OS (2012R2 or 2022) would almost take up 200GB ? Maybe like me, we have almost 90% of our workload on cloud or SAAS already (eHE, eFinance, Gmail or MS365, eERP) ?

IRL, I do not think you need to run hypervisors already…the Server 20xx with hyper-v role literally takes up 30% of your entire data center storage ? 250GB for hypervisor while VM(s) use 500GB ?

Getting servers with RI SSDs (RAID 1, 2x 1.92TB with 1 hot spare) could easily suffice your IOPs needs for the next 2 decades. …moving from SAS 10k RPM drives ?

I think you and the company leadership need to make a decision on how much business risk is acceptable. That will give clarity to which decision you choose. In my opinion, if you’re fully committed to on-prem, co-location is something you should strongly consider, or at the very least something like Azure Site Recovery. This will give you a fall-back if you’re trying to eliminate the types of situations you described where you had to spend two days working on restoring things. I would really feel uncomfortable about hosting my own web applications on-prem from a security and performance standpoint without having the proper infrastructure.

For your specific question about the server/storage upgrades, I’d suggest looking at hyperconverge solutions over tradition SAN. Given the fact that IT is not 100% of your job description, you could really benefit from hyperconverge which is very flexible, easy to configure, and allows you to scale up and down easily by just adding/removing additional nodes. I’ve never used them before, but I’ve heard good things about Scale Computing from people in similar situations as yourself. The admin portal looks easy to use from what I’ve seen in the demos. But, even traditional vendors like Dell have hypconverge offerings as well which I’d recommend you consider.

image

We’re running on RAID 5 with SAS HDD 10K. Image above is our host production. Only our emails are hosted by Gmail. Everything else is on-premises.

And why not RAID10?

Am not sure if co-location would help us. It’s not really the environment that we cannot support. It is when disaster strikes suddenly and its mainly to do with the hardware/software on it. So far, it’s only happened once.

Definitely, let me look into it.

You are looking at most 2TB total storage ?

  • 250GB for the server 20xx with hyper-v role ?
  • Like you said only actively using 500GB of data (if you add 250GB for OS)
  • Then how about Domain Controllers ? OS only requires max of 250GB each (recommend at least 2 DC per network)
  • And the Veeam Backup & Replication Server ? (literally 250GB for OS and 100GB for VBR cache)

But when you say storage & backup, where are your pain-points ?
I have kinda described quite clearly how we do it for our smaller subsidiaries where they only actually need 1 unit of Dell 1U server to run a few VMs (DCs, sale servers or HR servers etc), so we use

  • HQ : 2 servers (1 prd & 1 standby)
  • HQ : NAS for VBR repository
  • DR site : 1 or 2 servers (used as replica host & recovery landing) usually the old servers from HQ
  • DR site : NAS for VBR repository
    Note we use Synology NAS like the Synology 620 slim with 6x 4TB SSDs

If you are using Read Intensive SSDs (approx 16,000 IOPs in RAID 5) after moving off SAS HDD (approx 300-500 IOPs in RAID 10), the speed increase is way too much to really describe. RAID 1 SSDs could give you the max IOPs (approx 26,000 IOPs).

Speed aside, although RAID 10 gives 2 media failure tolerance, it is not a “true 2 failure” as compared to RAID 6.
Then in your case (for the NAS or backup storage), I would choose 6x 1TB SSDs over 3x 2TB SSDs (in RAID 5) as you only “waste” 1 SSD for redundancy or even as hot spare.
Hot spares can be important as if there are any SSDs issue, you can start the rebuild remotely (if the NAS does not have “auto-rebuild”). By the time you drive back to HQ, rebuild would be completed or at least 20-50% completed ?

The above is loosely based on 4 HDDs or 4 SSDs with 50% Read 50% write

Yes, I think 2TB in total might be sufficient for the next 7 years. I am planning to split host OS and underlying VMs on separate RAID cards. From that disaster I experienced, depending on which RAID card failed, we could have easily retained our VM data and just rebuild host OS, or if VM RAID card failed retrieve from Veeam backup/replicas. Do you think this idea is appropriate?

  • Host OS on HPE NS204i‑u Gen11 NVMe Hot Plug Boot Optimized card; and
  • VMs on HPE MR408i-o Gen11

If the above is correct, 2TB will be fully used for data (not host OS) which should be plenty.

I have a domain controller as a VM on Production Server, VM on Backup Server, VM on Disaster Recovery Server (in a different LAN segment but I have configure DHCP failover from our data centre). So in total I have 3 DCs. I don’t create replicas of the Domain Controller, I only create backups.

Based on the experience I had with our setup, one of the design pitfalls was that our Backup Server and Disaster Recovery Server is under-provisioned. Rightfully, it should be have a buffer over and above the Production Server because it runs Veeam Backup & Replication Server. So the resource consumption on these two servers would be greater than Production Server if I had to move production workloads to these two servers. Am I correct?

If you look at the server specifications I provided I am 16GB short on RAM in Backup Server and Disaster Recovery Server as compared to Production Server. Rightfully I think I should be 32GB more than Production Server and also to overprovision my storage space to factor in Veeam Replicas.

This is exactly how we do it too. Except I have no external VBR repository. Backup is stored on internal storage of Veeam Backup Servers in both Backup and Disaster Recovery Servers. This is why I wanted an on-production site external backup storage with immutability.

If I go the route of purchasing a commodity server to act as Veeam Hardened Repository for on-site VBR repository, should I consider a 10-25GB network just for the backups so that backing up/restoring from this repository will run over this 10GB/25GB network rather than transversing my local client/data network and firewall (all routing is currently performed on the firewall).

If the above is correct, is 10GB sufficient? How long will it take to recover 2TB of data over a 10GB link on the same VLAN?

One other issue with brands like HPE is that their SSDs are quite costly. On the other hand, if I don’t go with suppliers non-HPE/Dell I don’t have the hardware support - and for a 1 man-show to support it, it might be daunting if some hardware failed. That’s why I was considering the pros vs cons of Object First vs. HPE commodity server and building it out on our own.

Because we have some budget and because management knows that I am supporting this on my own without a team, we don’t mind to spend a little more on the support or non-cost optimised server/solution. For example, we currently run Production, Backup and Disaster Recovery Servers. All three servers are of the same model and generation. For me, I learn one platform and I can easily replace power supplies or RAM because I am familiar with it. So if we go down this design architecture, I would like to replicate the models of the servers just so that it is easier to manage. Also, if we purchased on-site hot/cold spares, we have the luxury of using these spares at either locations.

I checked RAID performance on this website.
If I were to use:

  • HPE MR408i-o Gen11 P58335-B21 (this controller’s maximum specification is Random Read is 3,000,000IOPS, Random Write is 240,000IOPS)
  • HPE 1.92TB SATA 6G Mixed Use SFF SSD P40504-B21

In RAID1, I’d get 64,000IOPS (about 162 times more than current setup)

In current setup RAID5 SAS 10K, I get 394IOPS

I can say two things that are critical for the I/Os, database server and our file server.

I like this idea, I read a little about it on Spiceworks and Reddit, and the current consensus is not to run hot spares… I can’t understand fully why.

If I can get some ideas on the above design, let me write down a BOM, I hope to get some feedback.

I am thinking that you are overthinking things…especially for 2TB worth of data.

  1. I am using all Dell servers with 5yr NDB support (we usually extend to 7yrs). I would only consider 4hr for that production host, else standby and DR only NBD (next business day).

  2. I have never seen Dell servers with RAID adapter issues for past 22 yrs unless RAID adapter batt potential issues (but we change it before it fails). Thats what VMware ESXi with vCenter or even server 20xx with hyper-v role “live migration” (or VMware vMotion) is for ?

  3. Domain controllers are active in all the host (including DR site) as “never recover DCs”. So what you are doing is almost perfect. But do VBR backup of just one DC in case…

  4. To me top failures are SSDs (or HDD) then PSUs

  5. NEVER store backups on the host. NAS are cheap, like the Synology 620slim which you can even carry it.

  6. You may not need immutable backups, but password with encryption is highly recommended

  7. One of the powerful tools in VBR is the ability to create replica from using VBR backup data, instead of the VM. This will prove that

  • Backup is successful
  • Backup data can be used to create the replica of VM
  • Backup data set can be copied to DR site NAS (using Veeam Backup Copy)
  • Backup data on DR site can be used to create the replica of VM in DR site host
    So now you have a replica of the VM (except DCs) on 2nd host and DR host
  1. IOPs is not like MB per second…but number of read or write operations per second. Once we go from HDDs to SSDs, the performance is way too huge to compare…thats why the bus speeds are like 6Gbps or 12GBps…but if really moving from HDDs to SSDs & 2TB…lets not overthink it…
    Also just use SATA Read Intensive SSDs for your case. refer to way below for more reasons.

  2. RAM is relatively cheap…compared to the price of the server or even MS server licenses (or the server as a whole). I would always take that each host needs to run all the workloads (in case of hardware issues and need to reboot or down the host for a few hours or a few days)

  • 16GB for DC (some say 8GB is enough)
  • 16GB for VBR server (I would have VBR server on standby host & DR host, but DR host is only for controlling the NAS and proxy for jobs. All jobs run in HQ).
  • 16GB or 32GB for file server (depending on number of users and AV)
  • 32GB or 64GB for ERP (really depends)

You need to sometimes read the fully story ?
Are they talking about a NAS with 6 slots, 1U server with 8 slots or SAN with 24 slots ? Then do they need to use all the slots already ?
Do they need to leverage on certain speeds (for large DB over 10TB in size while running 20 VMs) like RAID 10 but cannot afford SAS write intensive SSDs ?

For your case where 2TB “raw space” can last you 7 years…but you may want to have 3TB just in case (optional)
Your 1U server should have 8 SSD slots…so you have the options of

  • RAID 1 using 2 x 1.98TB SSDs
  • RAID 1 using 2 x 1.98TB SSDs with 1 hot spare
  • RAID 5 using 3 x 1TB SSDs
  • RAID 5 using 3 x 1TB SSDs with 1 hot spare
  • RAID 6 using 4 x 1TB SSDs
  • RAID 6 using 4 x 1TB SSDs with 1 hot spare
  • RAID 10 using 4 x 1TB SSDs
  • RAID 10 using 4 x 1TB SSDs with 1 hot spare
    For your case, the cost between using 1TB SSDs (980GB) vs 2TB (1.98GB) is kinda negligible as the difference is small.
    If you replace the above from 4TB SSDs to 8TB SSDs, thats where you may see the pain…each 4TB or 8TB SSD can cost like $4.1k or $7.7k per piece ?

Then are they also comparing hot spares vs cold spares, where you buy extra SSDs and keep them outside the server ?
For example, I have 5 SAN appliances where there are 5 appliances x 24 slots of 4TB SSDs where they can leverage on “single universal hot spare”. So it makes sense for me to buy 2 or 3 spare SSDs from manufacturer so I swap faulty ones out immediately when rebuild begins, no need to wait up to 4hrs for Dell to deliver replacement ?
The issues does not come apparent when appliances is new but maybe after 36 months or over weekends or during covid ?

I would also recommend to have some Intel Xeon 14 or 16 core CPU, important not to exceed 16 cores per server due to MS server OS licensing.
Coz I see the Dell site for R450, 4310 12c vs 4314 16c is only $122 ?

It sounds like you’ve done a great deal of research and considered different technical design considerations. With what you noted about having no background in SAN hardware, I would highly suggest considering a SaaS cloud offering when it comes to backup and recovery. With the size of your environment, this is a huge capital investment that would be a fraction of the cost by moving to OpEx. I’d love to have a discussion on our offering from Redstor and how we support our clients.