MAY 12, 2026
EngBrief
Search⌘K
LatestTopicsSourcesSaved
Eng&Brief

Engineering insights from the world's best tech companies, curated and summarized.

Weekly brief

Browse

TopicsSourcesFavorites

More

SearchRSS Feed
© 2026 EngBriefUpdated every 4 hours
Sort
Topics
Sources
Today's dispatch · Editor's pick

Building hybrid multi-tenant architecture for stateful services on AWS

A large-scale ad-serving infrastructure on AWS overcame operational challenges with a hybrid multi-tenant architecture. The previous cellular architecture provided tenant isolation but created scalability, efficiency, and onboarding issues. A new tier-based architecture was designed with cluster-level isolation, using Amazon Route 53 weighted routing and AWS PrivateLink connectivity to improve operational efficiency. This three-level hierarchy allows for independent scaling to address AWS limits, reducing infrastructure setup steps by 80 percent.

AWS Architecture·1 min read·Today·Cloud / Architecture
Building hybrid multi-tenant architecture for stateful services on AWS
Fig. AWS-01
Trending this week3 / 455
1Labyrinth 1.1: Making End-to-End Encrypted Backups Even More ReliableEngineering at Meta · 22h ago2Building hybrid multi-tenant architecture for stateful services on AWSAWS Architecture · 1h ago3Choosing between single or multiple organizations in AWS OrganizationsAWS Architecture · 19h ago

The Digest

455 articles
AWS1h ago

Building hybrid multi-tenant architecture for stateful services on AWS

A large-scale ad-serving infrastructure on AWS overcame operational challenges with a hybrid multi-tenant architecture. The previous cellular architecture provided tenant isolation but created scalability, efficiency, and onboarding issues. A new tier-based architecture was designed with cluster-level isolation, using Amazon Route 53 weighted routing and AWS PrivateLink connectivity to improve operational efficiency. This three-level hierarchy allows for independent scaling to address AWS limits, reducing infrastructure setup steps by 80 percent.

CloudArchitecture
1 min
AWS19h ago

Choosing between single or multiple organizations in AWS Organizations

AWS organizations provide a centralized way to manage multiple accounts, offering benefits like consolidated billing, simplified governance, and resource sharing. Enterprises typically adopt a single organization for most customers, but may choose multiple organizations if they have independent business units, regulatory requirements, or strong segmentation needs. This approach provides stronger security isolation and governance flexibility. A single organization is preferred when teams share a corporate security policy, need centralized compliance enforcement, and want to consolidate billing. Multiple organizations are suitable for conglomerates, regulated businesses, or companies with separate leadership and security requirements. The choice between single or multiple organizations depends on balancing operational efficiency with risk isolation.

CloudArchitecture
1 min
Engineering22h ago

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Labyrinth 1.1 improves the reliability of end-to-end encrypted backups in Messenger by allowing messages to reach the encrypted backup in real-time, rather than waiting for the device to come back online. This is achieved through a new sub-protocol that ensures messages survive device loss, changes, and extended sign-in gaps. The update enhances the security and integrity of encrypted message history, making it more accessible to users across devices.

SocialScale
1 min
Stripe1d ago

Five vertical SaaS insights from Sessions 2026

Here are the five insights from the Stripe Sessions 2026 blog post: 1. **Expanding beyond software is key to differentiation**: Vertical SaaS platforms that embed themselves into customers' day-to-day operations through financial services like payments and lending can stay ahead of AI commoditization. Platforms like Toast and GlossGenius have seen significant increases in adoption and revenue growth by making payments a priority. 2. **Deepening operations integration builds a stronger moat**: By offering a multiproduct strategy, platforms can create a stronger financial moat that makes them harder to displace. Moxie's compliance tools and Slice's wholesale rates on pizza boxes are examples of services that are difficult for new AI-native competitors to offer. 3. **Vertical SaaS is finding success offering its own AI products**: 87% of SaaS platforms surveyed believe AI is an opportunity, and many are moving from experimentation to monetization. Platforms like Toast IQ, Quipli, and Clio are adapting to customer

PaymentsInfrastructure
1 min
Kiro3d ago

More room to explore: $20 paid tier sign-up bonus

Kiro has introduced a $20 sign-up bonus for new paid subscribers, doubling the previous credit limit and providing full model access from day one. This change aims to give developers sufficient runway to try Kiro before deciding on its suitability, with users getting access to premium models, including Claude Opus 4.7. The free tier remains unchanged, with free users having access to capable open weight models like Qwen3 Coder Next and DeepSeek v3.2.

AIDevTools
1 min
Pinterest3d ago

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models

Pinterest engineers integrated real-time context into sequential recommender models to enhance ad relevance, particularly on the Related Pins surface. This was achieved through a new Contextual Sequential Two Tower Model architecture, which incorporates a context layer into the query tower and uses synthetic augmented data to learn from real-time context during offline training. The model demonstrated a 3x to 10x increase in Recall@K and a 275-300% increase in candidate median relevance, resulting in a 0.7% lift in conversion-related ROAS.

Machine LearningData
6 min
Netflix3d ago

Scaling ArchUnit with Nebula ArchRules

By John Burns and Emily YuanIntroductionAt Netflix, we operate using a polyrepo strategy with tens of thousands of Java repositories. This means that we need...

StreamingScale
12 min
Kiro4d ago

Introducing Kiro Ambassadors

Kiro Ambassadors is a new program that selects engaged developers to collaborate closely with the Kiro team, providing feedback and influencing the product roadmap. In return, ambassadors receive a free Kiro subscription, early access to new features, and direct communication with the product and engineering teams. They commit to sharing their experience and product knowledge through content, events, and feature testing. The program aims to deepen the influence of developers who are active Kiro users, providing a platform for them to shape the product and drive meaningful technical extensions to the community. Ambassadors dedicate around 3-4 hours per month, including a monthly call with the Kiro engineering team and content creation or event participation.

AIDevTools
1 min
Cloudflare4d ago

Building for the future

Cloudflare's leadership, including Matthew Prince and Michelle Zatlyn, announced a significant workforce reduction of over 1,100 employees due to the increased adoption of AI within the company, requiring a reimagining of internal processes and roles. This change is part of Cloudflare's pivot to a high-growth, AI-driven organization, aiming to create value in the "agentic AI era." Cloudflare is providing generous severance packages to departing employees, including full base pay through the end of 2026 and vested equity.

NetworkingSecurity
1 min
The4d ago

The Pulse: AI load breaks GitHub – why not other vendors?

GitHub's reliability has significantly decreased, with multiple outages and data integrity issues in recent months. A data integrity incident occurred due to a bug that caused incorrect merge commits when using the squash merge method, impacting 2,092 pull requests and requiring customers to manually recover lost commits. GitHub's CTO attributed the reliability woes to a load spike from AI agent fuelled requests, which they are struggling to handle, despite a modest 3.5x load increase over two years.

CareerIndustry
1 min
Cloudflare5d ago

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Cloudflare's Security and Engineering teams quickly assessed the Linux kernel "Copy Fail" vulnerability upon public disclosure on April 29, 2026. They evaluated the exploit technique, checked exposure across their infrastructure, and validated that their existing behavioral detections could identify the exploit pattern within minutes. As a result, there was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Cloudflare's established procedures ensure that they have already deployed patches for critical vulnerabilities, in this case, allowing them to respond proactively to the issue.

NetworkingSecurity
1 min
Cloudflare5d ago

When DNSSEC goes wrong: how we responded to the .de TLD outage

Cloudflare's public DNS resolver 1.1.1.1 experienced significant outages due to incorrect DNSSEC signatures published by the Germany's top-level domain (TLD) .de operator DENIC. This led to Cloudflare returning SERVFAIL for .de-related queries, impacting millions of domains. To mitigate the issue, Cloudflare temporarily treated .de as an insecure zone, bypassing DNSSEC validation, although this made .de domains vulnerable to attacks. Cloudflare's "serve stale" feature also kicked in, continuing to serve cached records and reducing the impact of the outage.

NetworkingSecurity
1 min
Airbnb6d ago

Monitoring reliably at scale

Here's a summary of the post "Monitoring reliably at scale" from Airbnb Engineering: Airbnb's observability stack depended on the same systems it was intended to monitor, introducing a circular dependency that risked visibility during outages. To break this dependency, the team isolated compute resources and networking layers to provide redundant, highly available paths for collecting metrics. The team created dedicated Kubernetes clusters for observability workloads to minimize shared failure domains and operational overhead. For networking, they built a custom Layer 7 network ingress layer using Envoy to load-balance traffic, isolate observability traffic, and prioritize telemetry.

FrontendData Science
9 min
Slack7d ago

From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines

To eliminate SSH security risks, Slack's EMR data pipelines were modernized to a REST-based architecture, replacing 700+ SSH-based operators. The key breakthrough was the use of YARN Distributed Shell, which allowed arbitrary shell commands to be executed in YARN containers with resource allocation and lifecycle management, leveraging existing REST APIs. This solution enabled the migration of all SSH-based jobs, including Hadoop workloads and custom shell commands, with zero downtime across 8 data regions.

CollaborationInfrastructure
1 min
Netflix7d ago

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Saish Sali, Nipun Kumar, Sura ElamuruguIntroductionAs Netflix has grown, machine learning continues to support our ability to deliver value to members and...

StreamingScale
16 min
Cloudflare10d ago

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Cloudflare has completed a two-quarter engineering effort called Code Orange: Fail Small, focusing on enhancing infrastructure resiliency, security, and reliability. The initiative has introduced safer configuration changes through Snapstone, a system for gradual rollout and real-time health monitoring, and measures to prevent drift and regressions. Code Orange has also streamlined configuration deployments, strengthened incident management, and established backup authorization pathways to facilitate faster issue resolution.

NetworkingSecurity
1 min
Netflix10d ago

State of Routing in Model Serving

By Nipun Kumar, Rajat Shah, Peter ChngIntroductionThis is the first blog post in a multi-part series that shares technical insights into how our ML model...

StreamingScale
15 min
Pinterest10d ago

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

In Pinterest's online ML serving systems, a root-leaf architecture was optimized to reduce network bandwidth usage. Initially, excessive feature transmission from the root to the leaf caused a network bottleneck, requiring system scaling based on network usage. To address this, root-leaf network bandwidth usage was reduced by 20% with lz4 compression, though it also increased CPU usage and latency. However, this did not solve the underlying problem of shipping unused data. Instead, the "Send What You Use" approach was developed, which trims unnecessary features before transmission, potentially cutting root-leaf network usage by ~50%. This approach leverages model signatures to determine required features, ensuring only necessary data is transmitted between the root and leaf.

Machine LearningData
18 min
Engineering10d ago

How Meta Is Strengthening End-to-End Encrypted Backups

Meta has strengthened its end-to-end encrypted backup system by implementing over-the-air fleet key distribution for Messenger, enabling clients to verify the authenticity of HSM public keys and ensuring secure data storage. This is complemented by the publication of evidence on fleet deployments, providing transparency and proof of secure operations. The system, utilizing tamper-resistant hardware security modules (HSMs), ensures that users' backups remain inaccessible to Meta and third-party providers.

SocialScale
1 min
Spotify10d ago

Building a Natural Language Interface to the Spotify Ads API with Claude Code Plugins

Spotify Engineering developed a Claude Code plugin to provide a natural language interface to the Spotify Ads API. The plugin translates user requests into full Spotify Ads API campaign structures using Markdown files, a bash script, and Python helpers. This allows users to describe campaign requirements in plain English, reducing the cognitive distance between intent and execution.

MusicScale
1 min