
بروزرسانی: 27 خرداد 1404
Big Data Architecture: A ksqlDB and Kubernetes Tutorial
More precisely, we chose YugabyteDB because it is:
- PostgreSQL-compatible and works with many PostgreSQL database tools such as language drivers, object-relational mapping (ORM) tools, and schema-migration tools.
- Horizontally scalable, where performance scales out simply as nodes are added.
- Resilient and consistent in its data layer.
- Deployable in public clouds, natively with Kubernetes, or on its own managed services.
- 100% open source with powerful enterprise features such as distributed backups, encryption of data at rest, in-flight TLS encryption, change data capture, and read replicas.
Stakeholder
Modern architects must choose between rolling their own platforms using open-source tools or choosing a vendor-provided solution. Infrastructure-as-a-service (IaaS) is required when adopting open-source offerings because IaaS provides the basic components for virtual machines and networking, allowing engineering teams the flexibility to craft their architecture. Alternatively, vendors’ prepackaged solutions and platform-as-a-service (PaaS) offerings remove the need to gather these basic systems and configure the required infrastructure. This convenience, however, comes with a larger price tag.
The editorial team of the Toptal Engineering Blog extends its gratitude to David Prifti and Deepak Agrawal for reviewing the technical content and code samples presented in this article.
Further Reading on the Toptal Engineering Blog:
منبع
In an ideal world, our infrastructure would be installed and configured through IaC. We’d store our entire infrastructure description in Git, write unit tests, use pull requests, and create the whole environment using one click in our continuous integration and continuous deployment tool.
Kubernetes Operators
You’ll recall that our example project has 300 million daily heartbeat events resulting in 100,000 requests per second. This throughput generates a lot of data that is not useful to us in its raw form. We can, however, aggregate it to synthesize our desired final form: For each user, which segments of videos did they watch?
We could use simple and reliable subsystems like RabbitMQ and SQL Server, but our system load numbers exceed the limits of such subsystems’ capabilities. If our business and transaction load grows by 100%, for instance, these single servers would no longer be able to handle the workload. We need horizontally scalable systems for storage and processing, and we as developers must use capable tools—or suffer the consequences.
Description
2020-Present
NoSQL
- Supports transaction-oriented systems, such as accounting or financial applications.
- Requires a high degree of data integrity and security.
- Supports dynamic schemas.
- Allows horizontal scalability.
- Delivers excellent performance with simple queries.
Cloud expansion
Although our design comprises many components, our system is relatively simple in the overall architecture diagram:
We’ll store any data we process in an external database. Kafka Connect allows us to do this easily by acting as a framework to connect Kafka with other databases and external systems, such as key-value stores, search indices, and file systems. If we want to import or export a topic—a “stream” in Kafka parlance—into a database, we don’t need to write any code.
2007-2014
Using this form results in a significantly smaller data storage requirement. To translate the raw data into our desired format, we must first implement real-time stream-processing infrastructure.
Our primary use cases are:
Customers | We don’t need to learn a new language to work with Pulumi. We can use one of our favorites:
The prevalence of SQL databases and batch processing | Kafka itself is just a distributed log system. Traditional Kafka shops use Kafka Streams to implement their stream processing, but we will use ksqlDB, a more advanced tool that subsumes Kafka Streams’ functionality: Infrastructure-as-code (IaC) enables DevOps teams to deploy and manage infrastructure with simple instructions at scale across multiple providers. IaC is a critical best practice of any cloud-development project. So we have roughly 300 million heartbeat events daily and 100,000 requests per second (RPS) at peak times: For more than two decades, few developers and architects dared touch big data systems due to implementation complexities, excessive demands for capable engineers, protracted development times, and the unavailability of key architectural components. With YugabyteDB, we have a perfect match for our architecture, and now we can look at our stream-processing engine. Real-time Stream ProcessingHaving discussed all significant components, we may now examine an overview of our system. Our Architecture With Preferred SystemsCharacterized by | Third-party License Holders | Let’s define hypothetical requirements for a system to demonstrate a big data architecture aimed at a general-purpose application. Say we work for a local video-streaming company. On our platform, we offer localized and original content, and need to track progress functionality for each video a customer watches. The landscape is composed of MapReduce, FTP, mechanical hard drives, and the Internet Information Server. |
---|---|---|---|---|
In the early big data days, developers managed their Kubernetes clusters with raw manifest definitions. Then Helm entered the picture and simplified Kubernetes operations, but there was still room for further optimization. Kubernetes operators came into being and, in concert with Helm, made Kubernetes a technology that developers could quickly put into practice. Big data requires a database. I’ve noticed a trend away from pure relational schemas toward a blend of SQL and NoSQL approaches. SQL and NoSQL DatabasesFocusing on our Kubernetes environment, we can simply install our Kubernetes operators, Strimzi and YugabyteDB, and they will do the rest of the work to install the remaining services. Our overall ecosystem within our Kubernetes environment is as follows: SQL | Use Case | |||
Our chosen product also features attributes that are desirable for any open-source project:
Interestingly, a new class of databases has evolved to cover all significant functionality of the NoSQL and SQL systems. A distinguishing feature of this emergent class is a single logical SQL database that is physically distributed across multiple nodes. While offering no dynamic schema, the new database class boasts these key features:
More knowledgeable teams that approach stream processing tend to choose either the pricier option of AWS Kinesis or the more affordable Apache Spark Structured Streaming. Apache Spark is open source, yet vendor-specific. Since the goal of our architecture is to use open-source components that allow us the flexibility of choosing our hosting partner, we will look at a third, interesting alternative: Kafka in combination with Confluent’s open-source offerings that include schema registry, Kafka Connect, and ksqlDB. Companies may effectively adopt big data systems using a synergy of cloud providers and cloud-native, open-source tools. This combination allows them to build a capable back end with a fraction of the traditional level of complexity. The industry now has acceptable open-source PaaS options free of vendor lock-in. Together, these components allow us to ingest and process the data (for example, group heartbeats into window sessions) and save to the database without writing our own traditional services. Our system can handle any workload because it is distributed and scalable. 2000-2007 | Many smaller teams with no big data experience might approach this translation by implementing microservices subscribed to a message broker, selecting recent events from the database, and then publishing processed data to another queue. Though this approach is simple, it forces the team to handle deduplication, reconnections, ORMs, secrets management, testing, and deployment. Big data architects shift their focus toward high availability, replication, auto-scaling, resharding, load balancing, data encryption, reduced latency, compliance, fault tolerance, and auto-recovery. The use of containers, microservices, and agile processes continues to accelerate. |
Before we choose our specific systems, let’s consider our high-level architecture:
We will deploy all of our stateful and stateless services to Kubernetes. For our stateful services (i.e., YugabyteDB and Kafka), we will use an additional subsystem: Kubernetes operators.
This deployment describes a distributed cloud architecture made simple using today’s technologies. Implementing what was impossible as recently as five years ago may only take only a few hours today.
Modern databases of each type are beginning to implement one another’s features. The differences between SQL and NoSQL offerings are rapidly shrinking, making it more challenging to choose a tool for our architecture. Current database industry rankings indicate that there are nearly 400 databases to choose from.
Distributed SQL Databases
Advertisers require impression metric reports based on user actions.