Resolving Panic Send On Closed Channel Error In ScyllaDB Gemini

Jul 24, 2025 by gitftunila 64 views

Understanding and Resolving the "Panic: Send on Closed Channel" Error in ScyllaDB Gemini

This article delves into a specific error encountered while running the Gemini test suite on ScyllaDB, a high-performance NoSQL database. The error, panic: send on closed channel, surfaced after an 8-hour test run and provides valuable insights into potential concurrency issues within the Gemini testing framework or ScyllaDB itself. Understanding the root cause of this panic is crucial for ensuring the stability and reliability of ScyllaDB. This article aims to dissect the error, analyze the context in which it occurred, and discuss potential solutions and preventative measures.

Before diving into the specifics of the error, it's essential to understand the technologies involved. ScyllaDB is a distributed NoSQL database built for high throughput and low latency, often used in applications requiring massive scalability. It is designed to be compatible with Apache Cassandra but offers significant performance improvements due to its architecture, which leverages a shared-nothing approach and asynchronous I/O. Gemini, on the other hand, is a powerful tool used for testing the consistency and performance of distributed databases like ScyllaDB. It simulates real-world workloads and scenarios, helping to identify potential issues before they impact production environments. Gemini works by generating a diverse set of operations (reads, writes, deletes) and verifying that the database maintains data integrity under these conditions.

The error message panic: send on closed channel is a common Go runtime error that indicates an attempt to send data on a channel that has already been closed. In Go, channels are a primary mechanism for communication and synchronization between goroutines (concurrently executing functions). When a channel is closed, it signals that no more data will be sent on it. Attempting to send data to a closed channel results in a panic, which is a critical error that can halt the program's execution. The stack trace provided in the error log offers clues about where the panic originated. Let's break down the key parts of the stack trace:

goroutine 101 [running]:
github.com/scylladb/gemini/pkg/generators.(*Partition).push(...)
	/home/runner/work/gemini/gemini/pkg/generators/partition.go:90
github.com/scylladb/gemini/pkg/generators.(*Generator).fillAllPartitions.func2()
	/home/runner/work/gemini/gemini/pkg/generators/generator.go:254 +0x273
github.com/scylladb/gemini/pkg/metrics.(*RunningTime).RunFuncE(0xc000105e68, 0xc000105ed0)
	/home/runner/work/gemini/gemini/pkg/metrics/metrics.go:290 +0x51
github.com/scylladb/gemini/pkg/generators.(*Generator).fillAllPartitions(0xc0001382c0)
	/home/runner/work/gemini/gemini/pkg/generators/generator.go:240 +0x271
github.com/scylladb/gemini/pkg/generators.(*Generator).start.func2()
	/home/runner/work/gemini/gemini/pkg/generators/generator.go:177 +0x45
created by github.com/scylladb/gemini/pkg/generators.(*Generator).start in goroutine 1
	/home/runner/work/gemini/gemini/pkg/generators/generator.go:171 +0x16a

github.com/scylladb/gemini/pkg/generators.(*Partition).push(...): This indicates that the panic occurred within the push method of the Partition struct in the generators package of Gemini. This method is likely responsible for sending data to a channel.
github.com/scylladb/gemini/pkg/generators.(*Generator).fillAllPartitions.func2(): This suggests that the push method was called from within an anonymous function inside the fillAllPartitions method of the Generator struct. This function is probably part of a goroutine that's filling partitions with data.
github.com/scylladb/gemini/pkg/metrics.(*RunningTime).RunFuncE(...): This line implies that the execution of the function is being monitored by a RunningTime metric, which is used to track performance.
github.com/scylladb/gemini/pkg/generators.(*Generator).fillAllPartitions(...): This is the method responsible for filling partitions, which are likely data segments within the database.
github.com/scylladb/gemini/pkg/generators.(*Generator).start.func2(): This is another anonymous function, this time within the start method of the Generator struct. It's likely the entry point for the goroutine that's responsible for filling partitions.
created by github.com/scylladb/gemini/pkg/generators.(*Generator).start in goroutine 1: This confirms that the goroutine where the panic occurred was started by the start method of the Generator struct. This gives a clear picture of the execution flow that led to the panic. The Generator is responsible for creating and managing data, and it uses goroutines to fill partitions concurrently. The error suggests that a channel used for communication between these goroutines was closed prematurely, leading to the panic when a goroutine attempted to send data on it.

To further understand the context of the error, let's examine the Gemini test configuration used in this run. The command-line arguments provided in the error log offer valuable clues:

gemini --test-cluster="10.4.0.230,10.4.1.192,10.4.2.102,10.4.0.239,10.4.3.180,10.4.3.151" \
--seed=54 \
--schema-seed=54 \
--profiling-port=6060 \
--bind=0.0.0.0:2112 \
--outfile=/gemini_result_7d9b0dc6-4707-443c-ac52-2d9bd17cc4a1.log \
--replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}" \
--oracle-replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}" \
--oracle-cluster="10.4.0.241" \
--table-options="gc_grace_seconds=60 " \
--level=info \
--request-timeout=3s \
--connect-timeout=60s \
--consistency=QUORUM \
--async-objects-stabilization-backoff=10ms \
--async-objects-stabilization-attempts=10 \
--dataset-size=large \
--oracle-host-selection-policy=token-aware \
--test-host-selection-policy=token-aware \
--drop-schema=true \
--cql-features=normal \
--materialized-views=false \
--use-server-timestamps=true \
--use-lwt=false \
--use-counters=false \
--max-tables=1 \
--max-columns=16 \
--min-columns=8 \
--max-partition-keys=6 \
--min-partition-keys=2 \
--max-clustering-keys=4 \
--min-clustering-keys=2 \
--partition-key-distribution=uniform \
--partition-key-buffer-reuse-size=256 \
--statement-log-file-compression=zstd \
--io-worker-pool=2048 \
--duration 8h \
--warmup 10m \
--concurrency 100 \
--mode mixed \
--max-mutation-retries-backoff 10s \
--max-mutation-retries 60 \
--token-range-slices 20000 \
--max-errors-to-store 1

Key configuration parameters to consider:

--duration 8h: The test ran for 8 hours, indicating a long-running operation. This increases the likelihood of encountering race conditions or resource exhaustion issues.
--concurrency 100: Gemini was running with a concurrency of 100, meaning 100 goroutines were simultaneously generating and executing database operations. This high concurrency could exacerbate any underlying synchronization issues.
--mode mixed: The test was running in mixed mode, which involves a combination of read and write operations. This adds complexity to the test scenario.
--dataset-size=large: A large dataset size likely means more data was being generated and processed, potentially increasing the load on the system and the likelihood of encountering concurrency issues.
--table-options="gc_grace_seconds=60 ": The gc_grace_seconds is set to 60 seconds, which might influence how tombstones are handled and could potentially interact with data generation and cleanup processes.

These parameters suggest that the test was designed to heavily load the ScyllaDB cluster, making it more likely to expose any concurrency-related bugs in Gemini or ScyllaDB. The panic: send on closed channel error, in this context, points towards a potential issue in how Gemini manages its internal channels for communication between goroutines, especially under heavy load and long-running tests.

The issue description also provides details about the ScyllaDB cluster used for the test:

Cluster size: 6 nodes (i4i.2xlarge)
Scylla version: 2025.1.4-20250707.20afd2776561 with build-id 36ff3bad104de1121fdc101e72233d77111253c1
Kernel Version: 6.8.0-1031-aws
OS / Image: ami-0252ea9d9736d854b (aws: undefined_region)

The cluster consisted of 6 nodes, each using an i4i.2xlarge instance type. This instance type is known for its high I/O performance, which is crucial for ScyllaDB workloads. The ScyllaDB version being used is 2025.1.4-20250707.20afd2776561, a specific build from July 7th, 2025. The kernel version is 6.8.0-1031-aws, indicating a relatively recent kernel. These details are important because they help narrow down the scope of the issue. If the error is specific to this ScyllaDB version or kernel, it can guide the investigation towards specific changes or known issues in those components. The fact that the cluster is running on AWS also means that the environment is relatively well-controlled, which can simplify debugging.

Based on the error message, stack trace, and test configuration, here are some potential causes and solutions for the panic: send on closed channel error:

Premature Channel Closure: The most likely cause is that a channel used for communication between goroutines in Gemini was closed prematurely. This could happen if a goroutine responsible for closing the channel finished its work before other goroutines that were supposed to send data on the channel completed their tasks. To address this, it's crucial to review the code that manages the channels in Gemini, particularly in the generators package. Ensure that channels are only closed after all senders have finished their work. Synchronization mechanisms like sync.WaitGroup can be used to ensure that all goroutines have completed before closing a channel.
Race Conditions: Race conditions can occur when multiple goroutines access and modify shared resources (like channels) concurrently without proper synchronization. This can lead to unpredictable behavior, including premature channel closure. To prevent race conditions, use synchronization primitives like mutexes or atomic operations to protect shared resources. The Go race detector (using the -race flag during compilation) can help identify potential race conditions in the code.
Error Handling: Inadequate error handling can also lead to this error. If a goroutine encounters an error and exits without properly signaling other goroutines, it might leave a channel in an inconsistent state. Ensure that goroutines handle errors gracefully and communicate failures to other parts of the system. Error channels can be used to propagate errors between goroutines.
Resource Exhaustion: In long-running tests with high concurrency, resource exhaustion (e.g., running out of memory or file descriptors) can lead to unexpected behavior, including channel closures. Monitor system resources during the test to identify potential resource bottlenecks. Adjust resource limits if necessary.
Bugs in Gemini or ScyllaDB: While less likely, it's possible that the error is caused by a bug in Gemini or ScyllaDB. Review the change logs and issue trackers for both projects to see if similar issues have been reported. If a bug is suspected, consider filing a bug report with detailed information about the error and the steps to reproduce it.

To effectively debug the issue, it's essential to have a reliable way to reproduce it. Here are some steps to try:

Run the Same Test Configuration: Use the exact same Gemini command-line arguments and ScyllaDB cluster configuration as in the original test run. This ensures that the test environment is consistent.
Reduce Concurrency: Try reducing the --concurrency value to see if the error still occurs. A lower concurrency might make the issue less frequent, but it can help isolate the problem.
Increase Duration: If the error is infrequent, try increasing the --duration to increase the chances of reproducing it.
Enable Debug Logging: Add more logging to the Gemini code, particularly around channel operations in the generators package. This can provide more insights into the state of the channels and goroutines.
Use the Go Debugger (Delve): Use a debugger like Delve to step through the Gemini code and inspect the state of channels and goroutines when the panic occurs. This can help pinpoint the exact line of code that's causing the issue.
Analyze Core Dumps: If ScyllaDB or Gemini generates a core dump when the panic occurs, analyze the core dump using tools like GDB to understand the state of the system at the time of the crash.

The impact of the panic: send on closed channel error is significant because it can halt the test run and prevent Gemini from completing its validation of ScyllaDB. This can delay the release of new ScyllaDB versions or features. The frequency of the error is not explicitly stated in the issue description, but the fact that it occurred after 8 hours of testing suggests that it might be an infrequent issue that's triggered by specific conditions or race conditions.

The panic: send on closed channel error encountered during the Gemini test run highlights the complexities of concurrent programming and the importance of proper channel management in Go. By analyzing the error message, stack trace, test configuration, and ScyllaDB cluster details, we can identify potential causes and solutions. The most likely cause is a premature channel closure due to race conditions or inadequate synchronization between goroutines in Gemini. To resolve the issue, it's crucial to review the Gemini code, particularly in the generators package, and ensure that channels are closed correctly and that shared resources are protected with appropriate synchronization mechanisms. Reproducing the issue reliably and using debugging tools like Delve and core dump analysis can help pinpoint the exact cause and implement a robust solution. Addressing this error will improve the stability and reliability of the Gemini testing framework and contribute to the overall quality of ScyllaDB.