Surprising fact: more than 70% of Malaysian data centers report that storage design—more than server count—determines app performance during peak demand.
We set out to simplify a complex choice. This piece compares two leading approaches to shared storage and shows where each excels.
We examine the head‑to‑head between vSphere‑native vsan and open‑source ceph — looking at real effects on performance, risk, and scalability for local businesses.
What matters: policy-driven controls, unified management, network design, and hardware needs all shape day‑2 operations and SLAs.
Our aim is pragmatic: translate technical nuance into business guidance so Malaysian teams can align investments to outcomes without added complexity.
Key Takeaways
- Align choice to workloads: pick the storage approach that matches your virtualization and cloud strategy.
- Performance is holistic: network, caching, and metadata placement matter as much as raw IOPS.
- Plan for skills: open‑source options offer flexibility but demand operational expertise.
- Keep procurement simple: reliable, predictable hardware and clear SLAs reduce operational risk.
- Scale with intent: the right design reduces long‑term complexity and supports business continuity.
Overview: Why “Ceph vs vSAN” matters in 2025 for Malaysia’s data centers
In 2025, storage choices shape how Malaysian organisations handle growth, cost, and resilience.
We see a clear shift from legacy NAS/SAN toward software-defined storage to reduce hardware lock‑in and improve scalability. This change affects how teams budget for capex and opex, and how they measure performance against business SLAs.
Procurement now balances licensing, support models, and people‑costs so that IT leaders can map costs to outcomes. The right pick also aligns with regulatory and environmental rules—data locality, backup policy, and continuity plans matter.
Network and networking design drive real-world results. Modern clusters rely on 10/25/100GbE, jumbo frames, and east‑west traffic patterns to meet throughput and latency targets.
Operational fit is decisive: vsan often wins in VM‑first vSphere estates for native management, while ceph suits mixed block, file, and object needs across multiple environments.
- Scalability maps to business growth: add nodes for capacity and throughput without forklift upgrades.
- Choose by TCO, skills, support, and performance per ringgit—this evaluation framework reduces retrofit risk and speeds hybrid cloud or AI projects.
What is Ceph? Open-source, software-defined, unified storage
We describe a unified, open storage system that serves block, file, and object workloads from a single cluster. This system removes silos and lets teams manage all data types in one pool.
Core components and how they work
MON holds cluster state and maps. OSD daemons store and replicate data—usually one per device and each needs ~4GB RAM. MDS handles metadata for file namespaces. Mgr provides monitoring and management hooks.
Storage types and resilience
The platform supports RBD for block, CephFS for file, and RGW for S3-compatible object access. The CRUSH map places data across failure domains, enabling self-healing and fault tolerance without central bottlenecks.
| Feature | Why it matters | Operational baseline |
|---|---|---|
| Unified services | Run block, file, object from one cluster | 3+ nodes; policyable pools |
| Self-healing | Automatic recovery after drive/node failures | CRUSH maps; replication or erasure coding |
| Performance tiers | NVMe for hot data, HDDs for capacity | 10GbE+; SSD journals for metadata |
Management matters: pool design, monitoring, and right-sized OSDs keep performance predictable. For a practical deployment checklist and related guidance, see our Proxmox VE guide.
What is vSAN? VMware’s native virtual SAN tightly integrated with vSphere
We explain how vSAN turns local host disks into a shared datastore managed from vSphere. The technology is built into the hypervisor and aggregates disk groups across ESXi hosts. This delivers a policy-driven storage layer for virtual machines.
Policy-driven controls translate business SLAs into technical rules. Storage policies cover FTT, RAID levels, deduplication, and compression so each VM gets the right protection and efficiency.
How it assembles and scales
vSAN combines local SSD and HDD in disk groups on each host to form a resilient datastore. Adding ESXi hosts increases capacity and performance together—simplifying procurement for compute and storage.
Where it excels
- Low latency block storage: tight vSphere integration keeps IO paths short for predictable performance.
- Single-pane management: native monitoring and familiar workflows reduce operational overhead.
- Policy-based resilience: automated rebuilds and protection set per workload.
| Capability | Benefit | Design note |
|---|---|---|
| Policy management | Translate SLAs to storage | Define FTT, RAID, dedupe per VM |
| Scale model | Grow capacity and IOPS | Add balanced ESXi hosts with disk groups |
| Performance | Predictable block IO | Use SSD tiers and proper host sizing |
Ceph vs vSAN: Head-to-head comparison at a glance
Operational fit matters most—here we map integration, flexibility, and management differences side by side.
Integration and daily management
vSAN embeds into vSphere—provisioning, policy enforcement, and monitoring sit in the vSphere Client. Daily tasks stay inside one console, which reduces change windows and speeds troubleshooting.
Ceph operates as an external SDS platform and integrates via RBD, NFS, or object gateways. It fits Kubernetes and OpenStack as well as vSphere, but requires separate management workflows.
Flexibility and protocol support
One platform is VMware‑centric and tuned for virtualization workloads. The other supports block, file, and object, enabling data use across multiple environments and hybrid clouds.
Performance, policy, and operations
Performance favors tight hypervisor coupling for low latency VM IO paths. The external SDS option delivers broader scalability and protocol flexibility but needs careful configuration for predictable latency.
Who owns storage and management changes team roles and SLAs—this is a key factor in selecting the right solutions for Malaysian deployments.
| Criteria | vSAN (native) | External SDS |
|---|---|---|
| Management | Single-pane in vSphere Client | Separate tools; broader ecosystem hooks |
| Protocol support | Primary block for VMs | Block, file, object — multi-protocol |
| Best fit | VM-heavy, predictable latency cases | Hybrid, cloud-adjacent use cases and multi‑stack environments |
Performance and latency: Tuning for real-world VM and container workloads
Real-world latency often comes from small misconfigurations, not raw hardware limits. We focus on the knobs that deliver consistent performance for virtual machines and containers in Malaysian data centers.
CPU, RAM, NVMe/SSD, and metadata devices
Right-size CPU and RAM for storage daemons and controller threads. For distributed OSD-like services, plan ~4GB RAM per device and reserve cores for IO paths.
Use NVMe or SSD tiers for hot data and place metadata or journal devices on dedicated SSDs. This reduces latency spikes during rebuilds and peak loads.
Network design
Ensure 10/25/100GbE fabric with end-to-end MTU and jumbo frames set consistently. East‑west traffic carries most storage IO—hidden bottlenecks ruin throughput.
Replication vs erasure coding
Replication gives lower latency and higher IOPS for transactional workloads. Erasure coding saves capacity but adds CPU and IO overhead—expect higher latency during writes and rebuilds.
- vSAN tuning: adjust FTT, RAID levels, dedupe, and compression to meet VM latency targets.
- Software-defined storage tuning: thread settings, NVMe selection, and pool layout keep performance predictable under maintenance.
- Instrument end-to-end telemetry to correlate app latency with storage and network layers.
Scalability and growth: From three nodes to petabyte-scale clusters
Scaling a cluster well prevents surprise rebuilds and keeps latency predictable as data volumes rise. Good growth planning ties capacity to operational practices. It protects SLAs and reduces costly hot‑fix windows in Malaysian facilities.
vSAN scaling inside vSphere
In a vSphere estate, adding an ESXi host increases both compute and storage capacity in lockstep. One extra node raises IOPS and usable space while keeping management inside the vSphere console.
Keep disk groups balanced and plan RAID/FTT settings to limit rebuild times. A single mis-sized node can create hotspots — avoid uneven drive mixes.
Scale-out across racks and failure domains
At massive scale, we expand the cluster across racks and sites and use placement maps to maintain placement tolerance. Design rack‑level and room‑level failure domains to contain blast radius.
- More nodes shorten peak IOPS per device but lengthen rebuild windows — reserve headroom for rebuilds.
- Shift from replication to erasure coding when capacity efficiency outweighs write latency costs.
- Align compute and storage purchases to rack power, cooling, and density limits.
Operational complexity, management, and learning curve
Operational readiness often decides whether a storage project succeeds or stalls. We focus on how day‑to‑day tasks shape outcomes for Malaysian teams.
vSAN’s native workflows and monitoring
vSAN integrates into the vSphere Client so provisioning, policy changes, and monitoring stay inside one console. This reduces change windows and lowers human error.
Deployment, CRUSH tuning, and ongoing optimization
The external system requires deeper expertise — cluster design, CRUSH maps, replication or erasure choices, and continuous tuning. Teams need Prometheus/Grafana or dashboards for steady performance.
- Day‑2 work: rolling upgrades, capacity expansions, and policy updates with minimal service impact.
- Configuration patterns: standardized node builds, version control for maps, and runbooks.
- Governance: clear roles for storage, network, and platform to reduce operational complexity.
| Area | vSAN | External system |
|---|---|---|
| Management | Single-pane | Separate tools |
| Learning curve | Moderate | Steep |
| Performance tuning | Policy-driven | Continuous |
Our advice: attach observability to SLOs, assign clear owners, and pace rollouts to keep complexity manageable while preserving flexibility for future solutions.
Cost and TCO in Malaysia: Licensing, hardware, and skills trade-offs
Total cost of ownership in Malaysia depends more on people and process than on sticker price. We weigh licensing, hardware, and operational effort so leaders can budget to outcomes.
vSAN licensing versus open-source software costs
Licensing for vsan varies by feature set and host count. Some teams see steady support value; others watch changes that tie fees to capacity.
Open-source software shifts spend to skilled operations. Savings on license fees reappear as training, automation, and longer planning cycles.
Hardware footprints and network impact
Hardware choices drive the biggest one-time spend: nodes, SSD/NVMe tiers, RAM, and CPU. A three-node all‑flash HCI (dual sockets, 1TB RAM, 60TB per host) is a real-world case—budget circa 180k.
Network spend (10/25/100GbE) matters for consistent performance during rebuilds and failures. Optics and switches should be in the procurement plan.
People, process, and local procurement
We recommend budgeting for training, runbooks, and support contracts. Local partner lead times and fiscal cycles shape when purchases land.
Align financial models to KPIs—map availability, performance SLOs, and growth to recurring and one-time costs.
| Cost driver | Impact | Mitigation |
|---|---|---|
| Licensing & support | Predictable ops spend vs capex shocks | Choose level of vendor support; model 3–5 year renewal |
| Hardware (nodes, SSD/NVMe) | Largest capex; affects IOPS and capacity | Right-size node configs; prefer all‑flash for performance SLOs |
| Network (switches, optics) | Affects rebuild and steady-state latency | Invest in 25/100GbE where rebuild windows matter |
| People & process | MTTR and operational risk | Train staff, create runbooks, and automate common tasks |
Use cases and environments: Matching solution to workload
Choosing storage starts with the workload: latency-sensitive virtual machines need a different path than AI pipelines. We map practical use cases to technology choices so Malaysian teams can align cost, risk, and scalability.
VM-heavy clusters, VDI, and private cloud
For VM-centric estates and VDI, vsan offers tight policy control inside vSphere. It makes it easy to translate SLAs into storage rules and ensures consistent latency for production workloads.
When to choose this: transactional databases and desktop pools that demand predictable IOPS and short tail latency.
Kubernetes, OpenStack, big data and object workflows
For container platforms, OpenStack services, and AI/ML pipelines, a unified stack that serves block, file, and object is often a better fit. This supports S3-compatible targets for backups and large-scale data ingestion.
These cases benefit from flexible protocol support and portability across multiple environments—helpful when pipelines span cloud and on-premises clusters.
Hybrid strategies: NAS, virtual SAN, and shared object stores together
Most real deployments use a mix. Use NAS for backups and archives, virtual SAN for production VMs, and an S3-capable layer for container volumes and analytics.
Design for operational outcomes—isolate noisy workloads, set guardrails with policies, and plan capacity growth to preserve performance as data and user counts rise.
| Use case | Recommended solution | Key benefit | Performance sensitivity |
|---|---|---|---|
| VDI & transactional VMs | vSAN (policy-driven) | Predictable latency; single-pane ops | High |
| Containers, OpenStack, AI pipelines | Unified block/file/object | Protocol flexibility; portability across environments | Medium–High (depends on burst) |
| Backups & cold archives | NAS / object targets | Cost-efficient capacity; easy restore | Low |
Design and deployment best practices for each solution
Design choices made during deployment determine whether storage meets SLAs or creates repeated firefighting. We recommend a clear checklist that ties hardware, policy, and monitoring to business outcomes for Malaysian data centers.
vSAN: storage policies, host design, and vSphere best practices
Align storage configuration to SLA tiers—define FTT and RAID per workload and size disk groups consistently. Use host profiles to keep node builds identical and reduce drift.
Choose dedupe and compression per tier. Right-size disks and cache layers to stabilise performance during rebuilds.
Balanced node builds, RBD tuning, and placement
For external SDS we favour balanced CPU, RAM, and I/O. Put OSD journals on dedicated SSDs to avoid noisy neighbour effects.
Tune RBD settings for block workloads and map CRUSH to racks and rooms for clear failure domains and predictable fault tolerance.
Backups and DR: practical patterns
Backup is non-negotiable. We use Veeam to a TrueNAS target or to scalable file/object endpoints for long-term retention. Replicate offsite over dark fiber when possible.
“Standardise builds, automate config, and surface telemetry early — this lowers risk and keeps ops predictable.”
- Enforce dedicated storage VLANs and end-to-end MTU checks on the network.
- Standardise templates with Ansible/Terraform and Host Profiles for consistent management.
- Instrument with Prometheus/Grafana and vSphere alarms to catch anomalies before SLAs slip.
Network-first planning: Why SDS success depends on switching and topology
We prioritise the network because bandwidth and loss shape real storage outcomes.
Software-defined storage depends on the fabric more than on any single device. Bandwidth, latency, and packet loss determine stability and steady throughput.
Design a clean L2/L3 topology with deterministic paths. Ensure end-to-end MTU and jumbo frame consistency across switches and hosts. Inconsistent MTU is a leading cause of intermittent storage performance
Choose link speeds to match growth: 10/25/100GbE are common. Watch oversubscription ratios—east‑west storage flows need high sustained bandwidth, not bursty shared links.
Segment infrastructure: dedicate a storage network, apply QoS, and isolate background rebuild traffic from application data. This protects SLOs during maintenance and heavy jobs.
Validate before production with synthetic tests for packet loss, jitter, and buffer behaviour. Small drops escalate into tail latency, retries, and noisy neighbours—so measure proactively.
Telemetry is non‑negotiable. Accurate interface counters, queue depths, and switch buffers shorten incident time. Align procurement to long-term capacity and acceptable failure domains—optics, cables, and switch features matter.
| Area | Recommendation | Why it matters |
|---|---|---|
| MTU & jumbo frames | Set end-to-end; test on all devices | Prevents fragmentation and intermittent latency spikes |
| Link speed & oversubscription | Use 10/25/100GbE; limit oversubscription for east‑west | Maintains throughput during heavy storage operations |
| Segmentation & QoS | Dedicated storage VLANs and QoS rules | Protects application traffic from rebuilds and backups |
Migrations and decision framework: How to choose and move with minimal risk
Migrations succeed when we treat them as staged experiments, not single big‑bang moves.
We start with a decision framework: define scope, constraints, and success criteria. This helps select the right solution for your environment and roadmap.
Greenfield, brownfield, pilots and coexistence
Greenfield lets you build a clean cluster and validate configuration from day one. Brownfield requires coexistence—run pilots and migrate in waves to reduce disruption.
Sizing for performance, fault tolerance, and growth
Size for headroom: plan rebuild windows, seasonal peaks, and growth over three years. Balance cost and resilience by choosing replication or erasure coding per workload.
- Pilot small clusters to prove performance and manageability.
- Codify configuration standards—templates for policy, placement, and monitoring.
- Sequence data moves by application with clear maintenance windows and verification checks.
- Define acceptance gates: SLOs, fault‑tolerance tests, and operability checks before cutover.
“Pilot, verify, rollback—then scale.”
Finally, govern the change with communication plans and post‑migration reviews so lessons become repeatable practice across your organisation.
Conclusion
Our closing view: choose the solution that fits your virtualization stack, team skills, and growth plan.
For hypervisor-native environments, a virtual san approach delivers tight policy control and low-latency delivery to virtual machines. For multi-protocol needs, a unified block, file, and object system gives broad capabilities and strong fault tolerance across a well‑designed cluster.
Focus on fundamentals: right-size CPU and RAM, pick reliable drives and disks, and validate network design before scaling. Operational readiness — runbooks, lifecycle plans, and measured pilots — makes the architecture work in production.
In Malaysia, align procurement and talent planning to your chosen path. Both solutions succeed when you test, monitor, and scale with discipline.
FAQ
What are the core differences between the two storage platforms for virtualized environments?
The two platforms differ in design and integration. One is an open-source, software-defined system that provides block, file, and S3-compatible object interfaces and scales across racks using OSDs and monitor services. The other is a VMware-native virtual SAN tightly integrated with vSphere that presents a single resilient datastore using policy-driven management. Choice depends on ecosystem, management model, and administrative skills.
How does fault tolerance and self-healing compare when a drive or node fails?
Both offer redundancy. The open-source solution uses replication or erasure coding and a CRUSH-based data distribution to recover and rebalance automatically. The hypervisor-native system uses policy-based copies and rebuilds within the cluster to meet the configured failure tolerance. Recovery speed depends on network, disk types, and cluster size.
Which platform is better for mixed workloads — VMs, containers, and object storage?
For mixed workloads that require unified block, file, and S3 object access, the open SDS platform provides broader protocol support and native object targets. For VM-dense vSphere deployments, the hypervisor-integrated option offers simpler operational workflows and VM-focused optimizations. Many organizations use a hybrid approach to match workloads to strengths.
What hardware and network considerations most affect latency and IOPS?
CPU, RAM, and storage media (NVMe/SSD vs HDD) are primary. Metadata and journal devices also matter. Network design — 10/25/100GbE, MTU/jumbo frames, and low-latency switching for east‑west traffic — is critical. Proper tuning and sizing of cache tiers and replication parameters reduce latency and increase IOPS.
How do replication and erasure coding impact performance and capacity efficiency?
Replication offers simpler, lower-latency writes but uses more raw capacity. Erasure coding improves space efficiency and reduces storage overhead but increases compute and network load during writes and rebuilds. Use replication for hot VM disks and erasure coding for cold or capacity-optimized object workloads.
What is the typical scaling model and limits for large clusters?
One solution scales out by adding OSD-bearing nodes across racks and failure domains to reach petabyte scales with linear capacity growth. The other scales within vSphere cluster boundaries and benefits from tight host integration; scale is often constrained by cluster and licensing limits. Both require network and management attention as size grows.
How steep is the operational learning curve for each solution?
The hypervisor-native approach leverages existing vSphere workflows and monitoring, lowering operational friction for VMware teams. The open-stack-style SDS needs deeper systems knowledge — CRUSH maps, OSD tuning, and cluster health tools — and therefore requires more skills for deployment and performance optimization.
What licensing and TCO factors should Malaysian data center planners consider?
Licensing costs favor the hypervisor-native product when existing VMware licensing and support align with needs. The open-source platform has no software license fee but demands investment in servers, networking, support subscriptions, and skilled staff. Total cost of ownership depends on hardware choice, power, cooling, and personnel.
Which solution suits VDI, desktop virtualization, and mission-critical databases?
VDI and VM-heavy workloads often benefit from the hypervisor-integrated datastore due to predictable policy-driven performance. Mission-critical databases can run well on either platform if configured with low-latency NVMe, dedicated cache tiers, and tuned replication; choice depends on operational model and support SLAs.
How do you plan a migration from an existing SAN to one of these platforms with minimal risk?
Start with a pilot cluster and run representative workloads. Use greenfield or brownfield migration paths depending on compatibility. Size for peak I/O, set failure-tolerance policies conservatively, and implement stepwise cutover with replication or VM-level migration tools. Maintain backups and an offsite DR target during the transition.
What are recommended best practices for backups and disaster recovery?
Combine snapshot-based backups, offsite replication, and third-party backup tools that integrate with your hypervisor or object targets. Test restores regularly. For critical data, use geographically separated replication and immutable storage options to protect against corruption and ransomware.
How important is the network switching design for software-defined storage success?
Network design is foundational. Use dedicated storage networks or segmented VLANs, ensure sufficient bandwidth (25/100GbE for high-scale), enable jumbo frames where supported, and design for low-latency east-west traffic. Redundant paths and proper QoS keep rebuilds and replication from impacting production traffic.
What staffing and support models work best for each platform?
For the hypervisor-native product, leverage existing VMware admins and consider VMware support subscriptions. For the open SDS platform, plan for Linux and storage engineering skills and evaluate commercial support providers or managed service partners to cover operations and emergency response.
Can both platforms coexist in a hybrid architecture?
Yes — many organizations pair hypervisor-integrated datastores for VM workloads with software-defined clusters for object and container storage or big data pipelines. Integration points include NFS, S3 gateways, and replication tools that bridge the systems for tiering and DR.


Comments are closed.