Detecting suspicious activity on AWS using cloud logs

AWS offers a large spectrum of services and compute. The “shared responsibility” model in cloud presents a simplified structure of organization responsibilities and cloud provider responsibilities. Generally, identity and access management (IAM), applications, and data form the dividing line, but lines blur depending on the given cloud service the organization is consuming. This is true of all cloud providers, including the AWS Shared Responsibility Model.

Deployment mistakes, misconfigurations, use of vulnerable AMI or container images, or other changes made to AWS service configurations create security problems for organizations, exposing it to possible security incidents or breaches. We’ve seen no shortage of stories about ransomware attacks, privilege escalation, system compromise, data exfiltration, malicious cryptomining and other negative outcomes.

If you want to delve deeper into the anatomy of cloud attacks, read our GUIDE.

Detecting high-risk events in cloud and container environments is often described as finding a needle in a haystack. While AWS provides some native tools to help some of which carry additional cost, many organizations suffer from data overload that directly impacts their security program efficacy and ability to respond quickly to security events.

CloudTrail has me covered, right?

CloudTrail is ubiquitous, fully managed logging service that underpins most AWS service offerings. All actions taken by user identities, machine identities, or other AWS services are recorded as events. The most recent event history is stored and visible automagically in CloudTrail. For longer retention periods though, organizations must configure a Trail (which uses AWS S3 general purpose storage) or a Lake (which uses other AWS managed storage).

These are important distinctions to bear in mind. While CloudTrail is enabled by default and recent event history is a given, most organizations need extended retention to satisfy compliance, maintain extended audit trails, or to support security use cases like digital forensics and incident response (DFIR). In some cases, organizations may neglect or deliberately skip this extra step out of naivety or to avoid overloading logs and driving up cloud expenses.

Security best practices for CloudTrail include:

Configure CloudTrail for all organizational AWS accounts and regions.
Encrypt CloudTrail log files at rest.
Enable integrity validation of CloudTrail log files.

As an organization’s architecture within AWS and consumption of various AWS services increases, the volume of events and respective log sizes can increase exponentially. This reality is particularly true as organizations embrace higher levels of automation, adopt microservice architectures, and/or create API-based designs as machine communications skyrocket and supporting containerized or serverless compute is much more ephemeral. While some problems that existed in traditional datacenter environments are less of a challenge in cloud, such as strict limitations on storage due to available disk capacity, new problems take their place. Mountains of log data can quickly overwhelm most organizational IT and security teams.

How do you determine which events are actual threats?

Organizations often rely on multiple standards, frameworks, best practices, and regulatory requirements to inform their own secure defaults. A combination of approaches and tooling are used to validate and enforce configurations during design, development, build, and delivery, and then continuously in production. The barrage of common security activities includes IaC scanning, image scanning, infrastructure scanning, cloud posture assessment, runtime profiling, and runtime detection and response.

Determining the actual security risk of an event in production requires adequate baselines to know what should be “normal” for an organization’s environments. Known vulnerabilities (e.g., CVE-IDs), misconfigurations, and threat actors (e.g., threats defined within TI feeds) are certainly a start, but application activity, data access, and identity behaviors are unique for each organization.

Context is important in many security decisions, but it becomes critical for detection and response within production cloud and container environments

Events and log entries for general environments may be potentially risky, but they may also be expected for organization’s unique environments and architectures. As an example, it may be normal to expect AWS S3 bucket creation or deletion in the environment, but this should only hold true when initiated by a privileged user (not a machine identity) and never originating from a containerized workload. Such activity might also only be expected via the AWS CLI or appropriate API calls from trusted IP address ranges, such as from the organization’s on-premises datacenter or VPN.

CloudTrail captures all events within an AWS environment, but CloudTrail has no concept of safe vs. risky events. CloudTrail also has no inherent alerting capability. Practitioners must engineer around CloudTrail to support their security use cases including alerting, threat detection, forensics, incident response, and threat hunting.

How does stream detection help with threat detection?

Organizations try to detect misconfigurations in the cloud environments with a variety of approaches, each with its own potential pitfalls:

Cloud security posture management (CSPM) – use a scanning process, such as API polling, at certain intervals to iterate through all service settings in an AWS account. Gathering and analyzing these snapshots to uncover disparities takes time. Polling intervals may be 24 – 36 hours in some cases. If an attacker succeeds in tampering or exploiting your tenant after a snapshot is taken, the CSPM won’t detect the event until the next polling interval.

Native cloud provider configuration analysis – like CSPM, these options often use a snapshot approach with polling intervals. An example includes AWS Security Hub, which exhibits 12-hour latency leaving a potentially large window of exposure for organizations.

SIEM ingestion and alerting – export log files to a SIEM, which may consume additional processing time and expense for storing and analyzing logs. The SIEM may already be overloaded with data in the hopes that it can still produce meaningful signals for a large spectrum of events beyond just cloud and container events such as email phishing or ransomware attacks. This approach can also suffer from the same window of exposure but also alert overload since all events may appear suspicious. Ingesting cloud and container data at scale almost always exacerbates the problems of slow MTTD and MTTR.

Manual log file analysis or threat hunting – as the name indicates, detection is based purely on the expertise of a security analyst and their ability to unearth meaningful signals from event noise.

Effective cloud detection and response capabilities must raise actionable alerts the moment an event appears in CloudTrail that’s indicative of a threat. Such detection capability also shouldn’t add costs that impact security budgets or delays that create unnecessary windows of exposure. The combination of Sysdig for telemetry gathering and Falco as a unifying threat detection engine can power a stream detection approach. Falco can evaluate every CloudTrail entry in real time against a flexible set of security rules. Those rules can alert or take an appropriate responsive action to support the organization’s cybersecurity goals without delays that are inherent in other approaches.