Among the many new features in Ubuntu 24.04 LTS is the ability to access your Microsoft OneDrive files through the Nautilus file manager. No 3rd-party app downloads, no dodgy scripts to run, and no paid plans to cough up for because this nifty feature is part of GNOME 46 (and available in any Linux distribution using it, not just the latest Ubuntu LTS). OneDrive file access works the same way as the (long-standing and popular) Google Drive integration: a Gvfs backend authorised through GNOME Online Accounts (via the Settings app), and then surfaced as an entry in the Nautilus sidebar. […]
Artificial intelligence is the most exciting technology revolution of recent years. Nvidia, Intel, AMD and others continue to produce faster and faster GPU’s enabling larger models, and higher throughput in decision making processes.
Outside of the immediate AI-hype, one area still remains somewhat overlooked: AI needs data (find out more here). First and foremost, storage systems need to provide high performance access to ever growing datasets, but more importantly they need to ensure that this data is securely stored, not just for the present, but also for the future.
There are multiple different types of data used in typical AI systems:
Raw and pre-processed data
Training data
Models
Results
All of this data takes time and computational effort to collect, process and output, and as such need to be protected. In some cases, like telemetry data from a self-driving car, this data might never be able to be reproduced. Even after training data is used to create a model, its value is not diminished; improvements to models require consistent training data sets so that any adjustments can be fairly benchmarked.
Raw, pre-processed, training and results data sets can contain personally identifiable information and as such steps need to be taken to ensure that it is stored in a secure fashion. And more than just the moral responsibility of safely storing data, there can be significant penalties associated with data breaches.
Challenges with securely storing AI data
We covered many of the risks associated with securely storing data in this blog post. The same risks apply in an AI setting as well. Afterall machine learning is another application that consumes storage resources, albeit sometimes at a much larger scale.
AI use cases are relatively new, however the majority of modern storage systems, including the open source solutions like Ceph, have mature features that can be used to mitigate these risks.
Physical theft thwarted by data at rest encryption
Any disk used in a storage system could theoretically be lost due to theft, or when returned for warranty replacement after a failure event. By using at rest encryption, every byte of data stored on a disk, spinning media, or flash, is useless without the cryptographic keys needed to unencrypt the data. Thus protecting sensitive data, or proprietary models created after hours or even days of processing.
Strict access control to keep out uninvited guests
A key tenet of any system design is ensuring that users (real people, or headless accounts) have access only to the resources they need, and that at any time that access can easily be removed. Storage systems like Ceph use both their own access control mechanisms and also integrate with centralised auth systems like LDAP to allow easy access control.
Eavesdropping defeated by in flight encryption
There is nothing worse than someone listening into a conversation that they should not be privy to. The same thing can happen in computer networks too. By employing encryption on all network flows: client to storage, and internal storage system networks no data can be leaked to 3rd parties eavesdropping on the network.
Recover from ransomware with snapshots and versioning
It seems like every week another large enterprise has to disclose a ransomware event, where an unauthorised 3rd party has taken control of their systems and encrypted the data. Not only does this lead to downtime but also the possibility of having to pay a ransom for the decryption key to regain control of their systems and access to their data. AI projects often represent a significant investment of both time and resources, so having an initiative undermined by a ransomware attack could be highly damaging.
Using point in time snapshots or versioning of objects can allow an organisation to revert to a previous non-encrypted state, and potentially resume operations sooner.
Learn more
Ceph is one storage solution that can be used to store various AI datasets, and is not only scalable to meet performance and capacity requirements, but also has a number of features to ensure data is stored securely.
Find out more about how Ceph solves AI storage challenges:
Out of the darkness and into the light, a new path forward
Back in 2020, the CentOS Project announced that they would focus only on CentOS Stream, meaning that CentOS 7 would be the last release with commonality to Red Hat Enterprise Linux. The End of Life (EOL) of CentOS 7 on June 30, 2024, means that there will no longer be security updates, patches or new features released for the OS.
If a user deployed Ceph on this version of CentOS, their future path is challenging. There are several options to work around the challenge of this EOL, but each comes with its own nuance:
Migrate to CentOS stream and potentially experience less stability due to the rolling nature of CentOS Stream.
Migrate to RedHat Enterprise Linux, which could be costly, due to the subscriptions required for RHEL and IBM Storage Ceph.
Migrate to another Linux distribution that promises binary compatibility with RHEL, but does not have concrete or sustainable plans to be able to do so.
Migrate to an alternative operating system like Ubuntu Linux, which has no difference between supported and self-supported versions, reducing possible lock-in to any one approach.
Risks
If the user does nothing, their ageing deployments will eventually have no supported path to upgrade to future versions of Ceph, leaving them behind in terms of new features and functionality. Even worse, that user will have no options for security patches for critical security bugs.
Migrating to RedHat Enterprise Linux certainly gives a supported approach, with future updates and upgrades available, but at the cost of potentially expensive enterprise licensing for both the OS and Ceph.
Several other Linux distributions have suggested that they will be able to keep binary compatibility with upstream RHEL, but without a legal or contractual agreement this is a risky approach.
One of the most common reasons for using CentOS was in non-production test and dev systems, where compatibility was assured with RHEL but there was no licence required. So that potentially means yet more additional cost to enrol those systems in support as well.
Another operating system
There’s a fifth option the user could take: one that eliminates their exposure to EOL, guarantees them long-term support for their OS and application, and which cuts out the need for expensive additional licensing.
We’re talking of course about moving to another open source operating system that operates licence-free. Ubuntu Linux is hands-down the go-to OS for these requirements: it has zero licensing requirements to use, and allows you to use Ceph without a licence.
Ubuntu also prevents lock-in. For production environments support can be purchased, providing up to 24/7 telephone and ticket support. A single straightforward subscription covers not only the base OS, but also Ceph (and other applications running on the same node(s)), with flexibility that accommodates users who feel that they do not need support in the future.
Storage migrations
Migrating data has always been a complex and time consuming process. Depending on the scenario there are multiple ways of approaching them. A user can copy their data between two storage systems via their host servers.
Or with the help of professional services an existing Ceph cluster can be converted from one operating system to another. An in-place migration, where the data remains on the same disks, just the OS and Ceph software is replaced around them is one approach. Or if deeper conversion work is required (e.g. filestore to bluestore), OSD nodes can be rotated out of the existing cluster, reconfigured and re-added, eventually replacing all of the existing installation.
These approaches can save a lot of administrative overhead, however each and every migration is bespoke, contact us to find out more.
Use open source Ceph storage to fuel your AI vision
The use of AI is a hot topic for any organisation right now. The allure of operational insights, profit, and cost reduction that could be derived from existing data makes it a technology that’s being rolled out at an incredible pace in even change-resistant organisations.
However, AI systems that deliver these insights, savings, and profits, rely heavily on access to large amounts of data. Without performant, and reliable storage systems, even the most cutting-edge AI solution will not be able to provide timely results. Additionally, these new AI related workloads cannot impact existing business applications, both need to operate together harmoniously.
In this blog, we will explore some of the requirements placed on a storage system by an AI solution, as well as the types of data used. We will introduce Ceph as one of the options available to store both AI-related data and typical business data.
The demands AI places on storage
New AI applications mean additional pressures and requirements for your storage systems. Here’s what your storage system needs in order to support new AI workloads:
High throughput
AI workloads need access to lots of data, and quickly: first, when reading raw data, and second, for writing the output following processing. It is not uncommon to see requirements in the region of 100’s GBps and over 1TBps!
Storage solutions such as Ceph allow for caching elements to be added to assist with bursty write workloads, and the ability to scale-out to increase overall system throughput.
Scalability
The AI infrastructure of today is not what the AI infrastructure of tomorrow will look like. A storage system needs to be able to adapt to the needs of AI workloads, both in its ability to scale up for capacity and throughout reasons, but also to scale down should the hardware need to be reused elsewhere in an organisation’s infrastructure.
Flexibility
Following on from scalability, a storage system needs to be flexible enough to accommodate different types of AI workloads. Not all data is created equal; some can be more important than others, and over time it’s value can change as well. For example, bank transaction data is more likely to be accessed during the first thirty to sixty days when people are checking their balances and viewing end of month statements, than say in 3 years time. However, it is still important that the data is preserved and available should it need to be accessed at that time.
Therefore your storage system needs to be capable of offering different tiers of storage to meet this requirement. A storage system such as Ceph allows a user to combine heterogeneous hardware, allowing them to mix and match over time as system needs dictate.
Reliability
The most important role that a storage system plays is storing data. There is little use of a storage system that is highly performant, but which cannot store data reliably; what good is generating and processing data if it can’t be retrieved?A solution like Ceph allows a user to choose from replication and erasure coding based protection strategies – again, to allow for a system configuration that can match business value with cost to store.
Types of AI Data
Now that we understand the characteristics that a high-quality storage system needs to provide, let’s take a look at what kinds of data are most typical in AI applications. There isn’t just “one” type of AI data. There are multiple different types, all used at various stages of developing, training and deploying AI models.
Raw and pre-processed data
This is the source data extracted and retrieved from all manner of applications and systems: chat tools, email archives, CCTV recordings, support call recordings or autonomous vehicle telemetry – just to name a few examples. This data can be in all forms: database tables, text, images, audio, or video.
Once extracted from these systems this data is typically pre-processed to ensure that it is in a useful format for training. Pre-processing can also remove otherwise redundant steps later in the pipeline, saving time and computation resources. With certain data sets pre-processing is used to anonymise the data to ensure regulatory compliance is met.
Training datasets
Training datasets are typically a subset of pre-processed data that is used to train an AI model. What makes this dataset special is that the expected model output has already been defined. It is important that these datasets are preserved so that they can be used to refine a model or evaluate its performance.
Models
The structure of an AI model – the layers and nodes – needs to be reliably stored so that the model can be redeployed in the future. Additionally an AI model contains parameters and weights that are tweaked during model training. Future adjustments can be made to these variables to fine tune the model or to deploy it in an inference role.
Results
This is the most important of all the steps of importing, pre-processing, training, and deployment. The output, or inference data, is typically the most useful and valuable data for business uses, and it must be stored so that it is available for use. In some cases, this data needs to be retained for auditing and future refinement.
Open source options for AI Storage
Finding a storage solution that delivers everything you’re looking for – cost, speed, flexibility, scalability, and support for a multitude of data sets and types – is difficult. Proprietary storage solutions can be inflexible, and public cloud services soon become costly as you grow; two areas where in-house open source solutions would be an ideal answer.
Canonical Ceph is a storage solution for all scales and all workloads, from the edge to large scale AI modelling, and for all storage protocols. Mixed workloads, with different performance, capacity, and access needs can all be accommodated by a single cluster. The scale out nature of Ceph means that hardware can be added incrementally to meet either performance or capacity needs.
Architectural overview of a Ceph cluster
Block
Block storage needs are provided via the RADOS Block Device (RBD) protocol – a highly scalable multipath-native block transport. To support legacy environments iSCSI can also be accommodated via gateway, and in a future release highly available NVMeoF will be supported as well.
File
Shared File storage is presented either via Ceph’s native POSIX compatible protocol CephFS, or via the NFS protocol again via a gateway.
Object
An Object storage API with compatibility for both the S3 and Swift APIs is fully supported in a Ceph cluster.
Learn more
Join our webinar on using Ceph for AI workloads here and learn about:
How can I securely store data in a cloud storage system?
Data is like the crown jewels of any organisation, if lost or exposed there could be severe repercussions. Failure to protect against system failure could lead to the loss of business data rendering a business non-functional and ultimately causing it’s failure. Exposing sensitive data to unauthorised parties not only leads to reputational damage, but can also cause businesses to incur massive fines.
This blog takes a closer look at these risks and how you can mitigate them with Ceph’s security features. Let’s start with some of the most common ways in which data breaches can occur:
Physical theft / transport
The loss of storage related hardware, disks or entire storage systems could lead to the exposure of sensitive information. This could happen during a traditional burglary situation, where an unauthorised party gains access to a data centre and removes hardware, or where a piece of hardware is intercepted during transit, for example when being returned to the manufacturer for repair or replacement.
Another type of physical compromise is via the theft of backup tapes, which can easily be mitigated with encryption, or tapeless backups that use inflight and at-rest encryption.
Corruption / Bitrot
Storage systems are made up of hardware, and sometimes hardware components can completely fail. In rarer cases, components like disk drives can introduce bit-level errors which cause corruption of the data that is being stored.
Most modern systems will also store checksums for slices or chunks of data that are stored, so that any corruption is discovered when the data is read. Some, such as Ceph, will proactively scrub the stored data, so that any potential corruption is detected and repaired from either other replicas or rebuilt from erasure coded chunks.
Network eavesdropping
When data is copied between systems, either on a local network, or across the internet, there is a possibility of eavesdropping, which means that the data could be intercepted by an unauthorised party during transmission. There are many components in a network path – network interface controllers (NICs), switches, routers, cables etc, and all of these can be compromised. Detection of such a compromise is difficult or impossible, even with state of the art technologies.
Insecure storage system software
A software supply chain attack could cause the software running within a storage system to be compromised, giving an adversary another path to introduce malicious code. This is not limited to just the core storage software, but all of the components as well, disks, NICs, RAID controllers etc. Keeping all of these software components uptodate is essential.
Malicious obfuscation and encryption
Ransomware attacks have become more and more common. They are a type of attack where a malicious party gains access to an organisation’s IT estate, and encrypts the contents of all storage devices, both local drives in servers, but also networked storage too.
Mitigate these risks with cloud storage security features
In a modern open source storage system such as Ceph, there are multiple ways for protecting against the risks outlined above.
Data at rest encryption
As data is written to disk, it is encrypted by the storage system, so that if a disk is stolen, lost, or returned to the manufacturer for replacement after failure, there is no chance of a leak of the data contained on the device.
Data in flight encryption
Using encryption for all flows of data across all networks means that no sensitive data can be intercepted. The storage system can either store the data in its encrypted form, or re-encrypt and use at-rest-encryption to securely store it.
Access control
Ceph makes use of CephX and LDAP to enforce strict access control across all protocols, ensuring that only authorised users have access to the block devices, file shares or object buckets that an administrator has mapped or shared with specific users.
Snapshots and versioning
Point in time snapshots can provide a user with the ability to roll back to a known good state after corruption or malicious encryption is detected, allowing for a recovery path from such events. Object storage also allows for full-object-versioning, which means that when a new version of an existing object is added to the system the older version is also retained and can be accessed if required. This feature is particularly useful in heavily regulated environments where an audit trail is required.
Key rotation
Cryptographic keys are used to secure communication between different devices, but it is of utmost importance that these keys are periodically renewed so that if a key were to be compromised the window for its use and a successful breach is relatively short.
Learn more
Ceph provides multiple mechanisms to secure data stored within the cluster no matter the protocol used. Additionally, even when hardware components are removed from the cluster, the data remains protected thanks to strong encryption. Internet facing APIs such a RADOS Gateway’s S3 endpoint can be configured to accept TLS connections only, and reject insecure HTTP.