Is “the cloud” secure?

Yes, cloud computing is not a new thing. Yes, it has been written many times before if the cloud is secure or not. Already in 2009, the European Union Agency for Cybersecurity (Enisa) published a comprehensive Cloud Computing Risk Assessment. Nevertheless, recently I was dealing a lot with the Cloud Security Alliance’s Security Trust Assurance and Risk Registry, with Software-Defined Networks (SDN), Network Function Virtualization (NFV) and Deep Packet Inspection (DPI).

I especially found that the CSA’s Security Guidance summarized very well the pros and cons of security in the cloud. As I observe an ongoing trend of client companies moving into the cloud, I wanted to list a few of the points mentioned by CSA, Graham Thompson, Jim Doherty, and others which I found worth to be remembered. In future posts I want to focus more in detail on what’s exactly changing for security in the cloud, including container security and DevSecOps.

In a nutshell: I think that for many companies, moving data to and processing data in the cloud is more secure than doing it on-premise. The typical company will not have the resources to assign specialized security staff to securing all the different facets of the enterprise. On the other hand, a cloud service provider (CSP), doing this for many clients at once, can easily afford it by reaping economies of scale. Not to forget that a CSP with a reputation for mediocre security will be out of business in the blink of an eye.

This being said, Gartner, in its report on the cloud landscape in September 2020, rightfully states the following:

All the providers claim to have high security standards. However, the extent of the security controls provided to customers varies significantly.”

Gartner Magic Quadrant for Cloud Infrastructure and Platform Services, September 2020, downloaded via Amazon AWS.

Hence, let’s make the nutshell a bit bigger…

What does “moving to the cloud mean” and what are the (non-security) benefits?

A common scenario is that a client company signs a contract with one of the big cloud providers that I compared in a previous post. This is then part of their strategy of “moving to the cloud”, which more often than not means moving computational power, storage and networking capabilities from an on-premise data center to a third-party’s off-premise data center, a.k.a. “the cloud”. Often-cited advantages of doing so are:

  • Agility: Moving faster in the cloud, without the need for lengthy hardware purchases and deployments.
  • Resiliency: Having less downtime through virtualization and failover capabilities of the software-defined network (SDN).
  • Economy: Saving money with monthly payments for what was actually used (OPEX) instead of paying for overprovisioned and aging on-premise hardware (CAPEX).

What is a cloud, anyway?

The US National Institute of Standards and Technology defined five essential characteristics, three cloud service models and four deployment models in its widely accepted Special Publication 800-145 from 2011. I will not go into detail on these, but quickly list them here.

Essential characteristics of the cloud:

  1. On-demand self-service
  2. Broad network access
  3. Resource pooling
  4. Rapid elasticity
  5. Measured service

Cloud service models:

  1. Software as a Service (SaaS)
  2. Platform as a Service (PaaS)
  3. Infrastructure as a Service (IaaS)

Some institutions also list models beyond these three “traditional” ones. The Cloud Security Alliance (CSA) also includes Security as a Service (SECaaS, basically SaaS for Security Software), Gartner even talks about “functions as a service (FaaS), database PaaS (dbPaaS), application developer PaaS (adPaaS)” – you can download the Gartner report via the AWS website.

Deployment models:

  1. Public cloud
  2. Private cloud
  3. Hybrid cloud
  4. Community cloud

Apart from the NIST definition, there are others as well, for example from ISO. The latter claims multi-tenancy as an additional characteristic, while NIST considers this to be implied in the “resource pooling” characteristic.

What should be understood when seeing resource pooling and rapid elasticity is that cloud providers are trying to minimize specialized hardware and appliances in their data centers and use common off-the-shelf hardware instead. On the one hand this reduces the risk of vendor lock-in. But it also increases their ability for resource pooling and elasticity: By running almost everything in a virtualized container of what previously came in a specialized appliance, CSPs can move functionalities to where they need it, when they need it. They are not working with one server of x CPUs, y GB of storage, z GB of memory and some network adapters (NICs). Instead, they are working with generic pools:

  • CPU pool
  • Memory pool
  • Storage pool
  • Interconnect Pool

They are then using a Distributed Resource Scheduler (DRS) to assemble these abstracted resources according to their clients’ need. What’s important to understand is that this automated “assembly” – called orchestration – is a key technique for the cloud: Abstraction without (automated) orchestration is virtualization. As the CSA explains:

“This is the difference between cloud computing and traditional virtualization; virtualization abstracts resources, but it typically lacks the orchestration to pool them together and deliver them to customers on demand, instead relying on manual processes.”

CSA Cloud Security Guidance v4, page 9

Security Challenges in the Cloud

As mentioned before, I want to go more into detail in following posts regarding changes to on-prem security paradigms when moving to the cloud and security aspects of technologies commonly found in the cloud like containers. But generally speaking, we can identify several challenges in the cloud.

Security of the management plane

The cloud introduces an additional dimension – the so-called “metastructure” – where all the cloud resources are managed via a “management plane”. Obviously, this management plane needs to be extremely well secured – otherwise a whole company’s assets could be compromised at once, or even deleted with the click of a button. A primary admin account for the management plane is usually locked away, with the password stored in a physical safe and with best practice security measures like multi-factor authentication (MFA) naturally enabled.

Break-glass process for incident response

The incident response process always needs to start with ensuring that the attacker is not in the management place (anymore) – otherwise, the attacker might notice your containment, eradication and recovery efforts and, sitting god-like in the management place, revert them all or worse, “delete your company”. To ensure the attacker is out, a proper break-glass procedure to retrieve the credentials and activate the primary cloud account needs to be defined and well-tested.

Security of the virtualization layer

Since cloud computing makes heavy use of virtualization, this additional virtualization layer also needs to be properly secured (e.g. the hypervisor, being basically “an OS for an OS”). While the CSP is mainly responsible for keeping the virtualization infrastructure secure, the consumer also needs to be aware that security of virtualized assets might require different provisions.

Clarity of shared responsibility

The cloud is a shared responsibility model, where the CSP and the cloud consumer are both responsible to ensure security. The responsibility is moving from “mostly the CSP’s responsibility” in SaaS, to “mostly the consumers responsibility” in IaaS. The biggest challenge here is to know and define exactly who is responsible for which aspect, what is guaranteed in the contract and service level agreements (SLAs), and what is implied by documentation and technology specifics. Having a responsibility matrix is a good idea.

Need for due diligence and addressing of governance gaps

Organizations moving to the cloud can never outsource their responsibility for proper governance (policies, processes, internal controls), including the underlying information security. If something happens, the cloud consuming party will be held responsible by its clients, so it would be well advised to previously conduct proper due diligence and to address potential governance gaps.

Reliance on external audits

Since moving to the cloud can decrease visibility into what is actually happening with the data and how well it is protected, clients need to rely on external auditors who assess the cloud provider’s compliance. Clients should use these audits – ensuring that the audit scope is relevant – as a tool for proving or disproving the CSP’s awareness of and adherence to corporate obligations (i.e. compliance).

Changes in digital forensics and need for increased logging

The ability for data collection for forensic analysis might be limited, especially in SaaS with provider-specific, proprietary software. Bit-by-bit copies of data will be nearly impossible and all involved parties would be wise to implement and enable as much custom logging as possible, to capture and store as much information – including meta-data – as possible.

Failure in isolation of tenant traffic

If the cloud provider fails to effectively isolate traffic of different tenants, the loss in confidentiality and reputation can be severe – for all involved parties. Even if the customer connects with an encrypted protocol (e.g. HTTPS) to the cloud resources, this encrypted tunnel is often terminated (i.e. unencrypted) at the CSP’s gateway. This means that data might be moving inside the cloud without encryption. Even if it is encrypted in transit, it will most likely be unencrypted in memory, while being processed. This is especially true for SaaS.

Encryption of remotely stored data

Encrypting data at rest is a very good idea, because if the CSP is hacked, the data might be at least unusable. However, if the keys for managing encryption are with the CSP, from a risk perspective it is similar to not having encryption at all – even if per-customer keys are used. On the other hand, if the customer itself manages the keys, it will need to be ensured that on any device – including potential BYOD system – the keys for decryption are installed. This is very difficult to achieve, decreasing the benefit of having omnipresent and device-agnostic data accessibility.

Need for adjusted Software Development Lifecylce (SDLC)

There are many frameworks that address the secure development of software, including Microsoft’s Security Development Lifecylce (SDL), or NIST SP 800-160. The CSA summarizes different frameworks in the following three “meta-phases” (see page 111 in the CSA guidance):

  • Secure Design and Development (including employee training)
  • Secure Deployment
  • Secure Operations

Long story short: Developing software for the cloud and securely running it there requires some fundamental changes in the whole design and development lifecycle and employees need to gain general as well as provider-specific cloud security knowledge to perform well in this new area. If applications are not written for the cloud and if they are hence not making use of its scalability via auto-scaling groups etc, they might well run less stable: One VM in the cloud can be less stable than in many dedicated on-prem environments, so the application needs to be cloud-aware and reap the benefits of virtualization to ensure stability and elasticity. It also should make use of provider-specific security capabilities, e.g. serverless load-balancing to avoid certain denial of service attacks altogether.

Controlling what goes in the cloud, to where, and for whom

A “lift and shift approach of moving everything into the cloud could seem as a straightforward and easy approach at first. But it certainly is not. Some data, including regulated information such as patient health data, should maybe never end up in the cloud. But legal requirements aside, risk evaluation should take place for all data that might be a candidate to be moved to the cloud. If a breach of the company’s intellectual crown jewels would mean the end to its business, this data will obviously need to be secured much more tightly than other data.

Of course, depending on the company’s own security posture, the data might still be more secure in the cloud – but maybe not, and this should be properly evaluated. After this, strict access management – both for people and for services – is a must, as is monitoring for intrusion and anomalies. Attribute-based access control (ABAC) should be preferred over Role-based access control (RBAC) for enhanced flexibility and security.

An entitlement matrix should be set up, detailing which users, groups and roles have access to which resources and functions. This should be regularly reviewed and an alarm should trigger whenever a resource changes to public visibility (it would not be the first S3 bucket with confidential data that can be accessed by anyone from the internet…).

More aspects to consider in Business Continuity Planning (BCP)

When a region of a company’s cloud provider goes down, it might have discussed a failover to another region. Hopefully, the jurisdiction in that other region also complies with all the legal requirements the customer’s data is subject to. Apart from that, while failover might have gone smoothly during all the tests for a company’s individual account, it might be a different story when a whole region is failing over to another one – which suddenly has to handle considerably more load.

In such cases, business continuity planning might suggest a failover to a completely different provider. But then again, the customer’s virtual assets need to be compatible with that provider, too. And employees will need to have expertise with that second provider, as well. Considering this, it should not be forgotten that downtime is always an option, given that the CSP’s recovery time objective (RTO) is within limits that are acceptable for the customer.

Another factor to consider is that with the cloud-specific metastructure that holds the management plane, yet another area of the logical stack needs to be considered. While technologies like software-defined infrastructure allow to quickly and exactly recreate a whole company based on templates, regular backups and restoration tests need to be added to business continuity and disaster recovery planning.

The cloud as aggregation of risk

Attackers are looking for targets where potential for bounty outweigh the necessary efforts. A small company might have a less then optimal security posture, but if there is not much to steal, attackers might move on. Then again, many clients store confidential data in one common place – the cloud. This is a massive concentration of risk, because it attracts a multitude of skilled attackers. If the cloud provider is doing things right, the security posture of each individual client might still be much higher than if they were hosting by themselves, but zero-day exploits and occasional isolation-circumventing vulnerabilities like Spectre and Meltdown are difficult to prevent before they are publicly known and patchable.

Security monitoring as additional cost factor

As the cloud customer hands over responsibility to the provider, many possibilities for customized security monitoring go away. Services such as a cloud-based SIEM are subject to the provider’s pricing scheme and if they provider’s proprietary applications do not create proper logs, there is not much to monitor, anyway. With multi-tenancy and resource pooling come economies of scope for the provider, but also less flexibility in contracts for the consumer.

Wanting to have customized logging or monitoring will probably be impossible or at least prohibitively costly. Finally, even if the information is there and can be sent to a monitoring solution sitting somewhere on the customer premises, cloud providers can decide to charge for egress – i.e. getting data out of the cloud. This can make it very costly to copy all audit and security logs to a local Security Information and Event Management (SIEM).

Speaking of SIEM: Distinguishing machines based on their IP or hostname as was originally done will be of little value in a cloud environment. The syslog header of a logging container or application will still contain an IP or hostname, but that instance might be running for just a few minutes and then be gone; the IP being reused by a potentially completely different program. Understanding such logs during forensic investigations can be close to impossible: Who owned that IP at a given point in time, who was the host it was talking to – such questions cannot be answered anymore without additional information. This also requires additional training for SIEM engineers and analysts.

Why the cloud can be more secure

So wait, in my “nutshell” at the beginning I wrote that I consider the cloud even more secure than many on-premise deployments, but now I’m listing that many challenges – how does this go together? Let me elaborate on my “nutshell” from above.

Economies of scale and incentive for security

Ok, so this first point is basically a repetition of my non-technical nutshell from above: A single company might not find the talent to secure its perimeter, internal infrastructure, all applications and accounts. It might also not have the money and the time to do so. A cloud provider on the other hand can reap economies of scale: Implement a high baseline security once (and improve continuously), then use it for all customers. It also has a strong incentive: Having customers assured that their data is secure in the cloud is a prerequisite to gain and retain customers.

Strict traffic isolation with Software-defined networks (SDN)

I do not want to lose myself in technical details, so to make it crisp and clear: For proper defense in depth, it is necessary to have data from different departments and/or clients strictly isolated. An individual company without a cloud might be working with VLANs, and the possibility to sniff network traffic for troubleshooting.

The problem with this is that VLAN does not really isolate the traffic at all. It is nothing more than a tag in the payload which tells the switch to separate traffic into logical units. Moreover, the possibility to sniff network traffic is convenient for troubleshooting purposes, but as soon as an attacker could access one machine in the network, it’s equally possible for him to start monitoring what’s going over the wire.

Software-defined networks (SDN) on the other hand can provide real traffic isolation. The layer 2 packets are not “VLAN-tagged” anymore (although SDN with VLAN is possible), but instead they are wrapped into a layer 4 UDP packet. This technique, known as VXLAN hence provides a more complete abstraction on top of the networking hardware (and enables a much bigger logical address space, important for scalability).

SDNs as used in cloud environments come with more benefits though: It decouples a router’s or switch’s control plane from its (packet-forwarding) data plane and moves that control plane into a centralized controller. That SDN controller can then automatically monitor traffic, find the best route for each packet, ensure Quality of Service (QoS) for each tunnel and customer, and much more. If the controller changes parameters for one customer or tunnel to improve the connection, it does so completely virtually – the underlying hardware remains unaffected, which means that also all other traffic is unaffected, making the whole network more stable.

By the way: The fact that QoS can be ensured per customer within one network (this can also be for example different departments of the same company) is not an obvious feature! By wrapping layer 2 packets into a UDP packet, software that is just looking into the header will only be able to ensure QoS for the whole tunnel, because the important details are wrapped and “hidden” inside the payload. An SDN controller however will be able to ensure QoS specifically per client.

Network sniffing between tenants and even within a tenant network is usually disabled for the reason explained above: to avoid that an attacker can use one compromised host to eavesdrop on the target’s communication. This can make troubleshooting a bit more difficult and usually will require enhanced logging in deployed applications to be able to investigate when something does not work. However, more logging is anyway an added benefit when it comes to security monitoring, so this setup is a real improvement from a security perspective.

SDNs hence centralize all control and management functions on layer 2 and 3 in one dedicated software controller, bringing all otherwise distributed “brains” from each network appliance, router, switch etc. into one centralized place. While the hype around SDNs has faded in recent years, it is a topic to stay and further gain adoption in the data center. For more information on the topic I recommend Jim Doherty’s book “SDN and NFV Simplified“. For specific implementations of SDN, go ahead and have a look at the OpenDaylight Project, Cisco’s Application Centric Infrastructure (ACI), or Juniper Contrail.

Flexibility and granularity for security functions with Network Function Virtualization (NFV)

While an SDN’s intent is to bring a network’s controlling logic “from distributed to centralized” on layer 2/3, it’s the other way round for Network Function Virtualization (NFV): The idea is to bring network functionality on layers 4 to 7 from centralized “chokepoints” to a distributed layout. What NFV does is to abstract network functions from their underlying hardware and to create software instances for each of these functions. What I mean with “network functions” are all the solutions usually found in dedicated network appliances such as web application firewalls (WAFs), load balancers (LBs), firewalls (FWs), intrusion detection and prevention (IDS/IPS), URL filters – you name it.

What advantage does this have?

In a “traditional” network you have dedicated security appliances at strategic locations where all traffic needs to go through in order to be investigated by the WAF, URL filter, etc. This means that you need to overprovision this hardware (to cope with peak load) and are nevertheless creating bottlenecks. What can also happen is that you are unnecessarily routing malicious traffic over large parts of your network before it reaches a security appliance where only then it is dropped.

By virtualizing such functionality – i.e. moving it out from dedicated hardware and running it on off-the-shelf hardware – you can bring it to exactly where it is needed and when it is needed. You don’t create bottlenecks since you can run security checks right at the network edge where traffic is hitting your network. Although hardware appliances will of course be much more efficient in handling lots of traffic, your virtualized functionality will usually not have to deal with such large amounts of data, since you can simply use many more WAFs, FWs, IDS etc. and funnel traffic in much finder granularity, processing only what is occurring at that specific location. This also avoids single points of failure, because if one virtualized appliance is failing, it is automatically and swiftly recreated in another virtual instance.

Together with the aforementioned explanation on SDN’s ability to isolate traffic, NFV enables even better network isolation. You can deploy as many virtual firewalls (security groups) as you want, set them to “deny by default” to only allow explicitly whitelisted traffic and protect each individual application. Even if one application is compromised by an attacker, blast radius is extremely limited, because the whole application will be surrounded by a virtual firewall, not allowing malicious traffic to go out.

Rapid incident response, leveraging SDN and NFV

For incident response, the rapid and flexible deployment of network functionality also brings advantages: Instead of shutting down a compromised host completely during containment, you can simply move it into an isolated security group where only the investigator will be able to enter, with all outbound traffic blocked.

In parallel and thanks to network virtualization, another virtual instance of the affected host can spawn up, take over the functionality of the compromised host and avoid any downtime. The compromised machine state can be saved, which is important as evidence for prosecution and also for digital forensics: You can “freeze” the crime scene and look at it as long as you want in order to understand the attack vectors and the kill chain. Such observations will bring important contributions to your security team’s post-mortem lessons learnt session and help with continuous improvement.

The secret behind the ability to isolate compromised hosts without any downtime for service delivery therefore lie again in SDN and NFV. An application does not run on a monolithic server that provides exclusively one functionality, but instead an application runs in an “auto scaling group“. The AWS explanation reads as follows:

“An Auto Scaling group starts by launching enough instances to meet its desired capacity. It maintains this number of instances by performing periodic health checks on the instances in the group. The Auto Scaling group continues to maintain a fixed number of instances even if an instance becomes unhealthy. If an instance becomes unhealthy, the group terminates the unhealthy instance and launches another instance to replace it.”

AWS User Guide – Auto Scaling groups

What this means is that there is no dedicated server anymore, but only a logical group that can use any kind of available infrastructure component (computing, storage, networking). By decoupling the underlying hardware layout and making it completely transparent to any running software, “magic” like seamless live migrations of running VMs to a totally different system are possible (see VMware’s vSphere vMotion for details).

A few comments on the technical backgrounds: When moving a virtual instance, the IP address of course has to remain the same so that it remains accessible. This is not trivial, because the VM might be moved to a totally different server somewhere else in the data center. Theoretically this could be solved using a lot of VLANs, but this does not scale well (limited to 4096 VLANs by design versus 16.7 million with VXLAN) and would be a nightmare to manage. It would also require a rather flat network where all systems are in the same broadcast domain so that they can directly address each other. This in turn would mean however that all systems would have to remember the MAC addresses of all other systems in that domain which they identified with their ARP requests, leading to very large lookup tables – apart from the potential broadcast storm that all so many ARP requests could create due to the lack of any TTL. SDN (using VXLAN) solves this in a very elegant and scalable way. If you are familiar with another overlay network technique – MPLS – you can imagine SDN as basically “MPLS plus orchestration”:

“Looking at SDN and MPLS as competing technologies is fundamentally wrong. MPLS is a key SDN enabler. This statement holds particularly true if you look at MPLS as an architectural paradigm (not as an encapsulation).”

Antonio Monge and Krzysztof Szarkowicz: MPLS in the SDN Era

Quick disaster recovery with Software-defined Infrastructure (SDI)

With SDN and NFV, major parts of your infrastructure will be virtualized. This comes with the added advantage that you can rebuild a known-good configuration extremely quickly through software orchestration – a concept known as Software-defined Infrastructure (SDI). In case of a security incident or any other major service disruption. Just make sure that regular backups are stored on some kind of offline storage, so that an attacker with access to the management plane cannot compromise the backups with the known-good configuration.

Safer testing and faster intrusion detection with Infrastructure as Code (IaC)

Software-defined Infrastructure explained above can be understood as one component in the general concept of Infrastructure as Code (IaC). The idea is to define templates of hardware-agnostic infrastructure configurations which can deployed quickly and as often as needed, without the risk of human error (forgetting some manual configuration etc.). Amazon and Microsoft list several more advantages in their respective IaC offerings AWS CloudFormation and Templates for Azure Resource Manager.

From a security perspective, immutability and version control are very interesting: Whenever you want to implement changes in the infrastructure, you should then do this in the infrastructure template, not in the running infrastructure in order to keep that change for future rollouts. Such template adjustments can be tracked and reverted through a version control system like Git. Moreover, templates can be used to create a test environment which is an exact copy of the production environment, making it possible to test software under conditions which come as close to production reality as it can get. This of course is a huge benefit when it comes to penetration testing, where results are only meaningful if the testing ground resembles reality.

With all changes implemented in the template instead of running infrastructure, it also becomes easier to lock down the (virtual) infrastructure components much more tightly than it would be possible in a non-cloud environment. In such less volatile environments it is also useful to have integrity monitoring software running. Such software alerts on any changes made to files and configurations. In a “traditional” setup where many things are constantly changing, integrity monitoring can overwhelm security analysts with alerts and it becomes cumbersome to spot the really dangerous changes among the many benign false positives.

Reduced attack surface with immutable workloads

This point can be understood as a part of the whole category of “immutable infrastructure as code”: If you define the whole infrastructure in your templates, it will naturally also include the images for your virtual machines. I nevertheless list immutable workloads here separately and want to go a bit more into detail.

If you set up and configure an instance beforehand using a known-good template, you can restrict available features considerably. For example, you will not need to have remote logins to that running instance enabled, if you know that the configuration is already done at startup time.

For communication with other network components, virtual machines should be restricted to using well-defined APIs. This way, the scope of available actions for attackers will be very much limited, making reconnaissance (such as port scanning and fingerprinting), lateral movement and privilege escalation somewhat harder. Available API calls should additionally be monitored for unusual activity to spot an attacker’s attempts as early as possible.

For troubleshooting in such a locked-down host, it might make sense to increase logging capabilities and store the logs outside of the running instance. Also, extensive testing should be done prior to the deployment into production, because afterwards – without the ability to log in to the running instance – tracing errors can be more difficult. This is not necessarily a disadvantage: as mentioned earlier, it only benefits your security monitoring as a whole if there are more logs, as long as they are meaningful and providing a clear picture of what was going on in the virtual machine. Additionally, extensive testing is of course always a recommendable practice to spot configuration flaws and other potential security holes.

With very little changing in a running instance, you also do not have to worry about newly installed services or unexpectedly modified files, making it much easier to implement security features such as service whitelisting and file integrity monitoring.

Zero Trust with Software-defined Perimeter (SDP)

The Cloud Security Alliance (CSA) developed the so-called Software-defined Perimeter framework. This model takes the abovementioned granular microsegmentation via SDN security groups to the extreme by basically starting with a “zero routing” network for unauthenticated users with no systems visible at all. The framework then establishes a temporary network for the individual user context, using three components:

  • SDP client installed on a user’s device (laptop, smartphone, etc.)
  • SDP controller that authenticates, authorizes and audits the user
  • SDP gateway that enforces routing policies as established by the controller

When a user authenticates, the SDP controller decides which systems can be accessed based on the device where the agent is running on, as well as on the user’s identity. Access is granted on a need-to-know basis and servers are only provisioned as required in that specific moment in that specific context. The SDP controller instructs the dynamically spawned servers to only accept connections from the other servers known to be needed for that specific session. The list of provisioned systems is then provided to the connecting user’s SDP client. The SDP gateway enforces routing rules that allow only access from the specific user device, for the specific user, to the specific hosts.

Above I wrote that an unauthenticated user will not see any systems at all – this includes even the SDP controller and gateway! They implement a special form of port knocking, known as Single Packet Authorization (SPA). A single packet, that can only be properly crafted and encrypted by a valid SDP client will be sent to the SDP controller. That controller is not even actively listening – making port scans impossible – but a second service is noticing the “knock” and if the packet is correct, the controller will start listening for traffic from the SDP client for a defined timespan (for example 30 seconds).

SDP is hence making SDN a real zero trust network with considerable security benefits. CSA lists a large number of advantages when using SDP, starting with obvious ones like the inability to perform denial-of-service attacks, over the inability for man-in-the-middle attacks, to less obvious points like “simplified forensics”: Since the whole connection is going over a defined route (SDP controller for AAA and SDP gateway for policy enforcement), the whole session can be logged, monitored and reviewed. If you know access monitoring tools like One Identity (previously Balabit) Safeguard, you will already appreciate the availability of such features when investigating a security incident. The “cloud challenge” mentioned above regarding the need for additional logging is however still applicable for cloud users – if the SDP infrastructure is run by the CSP, access logs will most likely not be readily available for the CSP’s clients during a forensic investigation. If the SDP is run by the client himself however, this very much eases said challenge.

In a very recent NIST special publication from August 2020, the definition of a zero-trust network summarizes very well the advantages; the Software-defined Perimeter is later on described as one possibility to establish a zero trust network:

“Zero trust (ZT) is the term for an evolving set of cybersecurity paradigms that move defenses from static, network-based perimeters to focus on users, assets, and resources. A zero trust architecture (ZTA) uses zero trust principles to plan industrial and enterprise infrastructure and workflows. Zero trust assumes there is no implicit trust granted to assets or user accounts based solely on their physical or network location (i.e., local area networks versus the internet) or based on asset ownership (enterprise or personally owned). Authentication and authorization (both subject and device) are discrete functions performed before a session to an enterprise resource is established. Zero trust is a response to enterprise network trends that include remote users, bring your own device (BYOD), and cloud-based assets that are not located within an enterprise-owned network boundary.”

NIST Special Publication 800-207, p. II

Management plane as single-pane of glass

While the management plane as “god-mode tool” can be a risk if hijacked by an attacker (already a higher-level application-control system can be very dangerous if in the wrong hands), it should also be noted that it can bring considerable security advantages. The management plane basically serves as a single-pane of glass, providing full-stack visibility for the the whole infrastructure. Such unified interface is usually not available in the jungle of different tools and management platforms usually found in the average company.

Conclusion

Of course, it’s not all roses. The abovementioned challenges are very relevant and moreover, it’s the same with the cloud as with everything else it’s not perfect. As cloud providers are developing their platforms, they are introducing inconsistencies, making the “single pane of glass” a bit less transparent.

Both, on-prem and in the cloud you can create safe environments – but in both cases you need to know what you are doing. Having a look at the Center for Internet Security (CIS) benchmarks and hardened images is certainly a good idea to ensure that your foundations are solid. If you are interested in some specific details where the different cloud providers have their strengths, weaknesses and the occasional inconsistencies, I can recommend this blog post.

This article was written by Fabian

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.