Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: testing/synctest: new package for testing concurrent code #67434

Open
neild opened this issue May 16, 2024 · 142 comments
Open

proposal: testing/synctest: new package for testing concurrent code #67434

neild opened this issue May 16, 2024 · 142 comments

Comments

@neild
Copy link
Contributor

neild commented May 16, 2024

Current proposal status: #67434 (comment)


This is a proposal for a new package to aid in testing concurrent code.

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

This package has two main features:

  1. It permits using a fake clock to test code which uses timers. The test can control the passage of time as observed by the code under test.
  2. It permits a test to wait until an asynchronous operation has completed.

As an example, let us say we are testing an expiring concurrent cache:

type Cache[K comparable, V any] struct{}

// NewCache creates a new cache with the given expiry.
// f is called to create new items as necessary.
func NewCache[K comparable, V any](expiry time.Duration, f func(K) V) *Cache {}

// Get returns the cache entry for K, creating it if necessary.
func (c *Cache[K,V]) Get(key K) V {}

A naive test for this cache might look something like this:

func TestCacheEntryExpires(t *testing.T) {
	count := 0
	c := NewCache(2 * time.Second, func(key string) int {
		count++
		return fmt.Sprintf("%v:%v", key, count)
	})

	// Get an entry from the cache.
	if got, want := c.Get("k"), "k:1"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}

	// Verify that we get the same entry when accessing it before the expiry.
	time.Sleep(1 * time.Second)
	if got, want := c.Get("k"), "k:1"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}

	// Wait for the entry to expire and verify that we now get a new one.
	time.Sleep(3 * time.Second)
	if got, want := c.Get("k"), "k:2"; got != want {
		t.Errorf("c.Get(k) = %q, want %q", got, want)
	}
}

This test has a couple problems. It's slow, taking four seconds to execute. And it's flaky, because it assumes the cache entry will not have expired one second before its deadline and will have expired one second after. While computers are fast, it is not uncommon for an overloaded CI system to pause execution of a program for longer than a second.

We can make the test less flaky by making it slower, or we can make the test faster at the expense of making it flakier, but we can't make it fast and reliable using this approach.

We can design our Cache type to be more testable. We can inject a fake clock to give us control over time in tests. When advancing the fake clock, we will need some mechanism to ensure that any timers that fire have executed before progressing the test. These changes come at the expense of additional code complexity: We can no longer use time.Timer, but must use a testable wrapper. Background goroutines need additional synchronization points.

The synctest package simplifies all of this. Using synctest, we can write:

func TestCacheEntryExpires(t *testing.T) {
        synctest.Run(func() {
                count := 0
                        c := NewCache(2 * time.Second, func(key string) int {
                        count++
                        return fmt.Sprintf("%v:%v", key, count)
                })

                // Get an entry from the cache.
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Verify that we get the same entry when accessing it before the expiry.
                time.Sleep(1 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:1"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }

                // Wait for the entry to expire and verify that we now get a new one.
                time.Sleep(3 * time.Second)
                synctest.Wait()
                if got, want := c.Get("k"), "k:2"; got != want {
                        t.Errorf("c.Get(k) = %q, want %q", got, want)
                }
        })
}

This is identical to the naive test above, wrapped in synctest.Run and with the addition of two calls to synctest.Wait. However:

  1. This test is not slow. The time.Sleep calls use a fake clock, and execute immediately.
  2. This test is not flaky. The synctest.Wait ensures that all background goroutines have idled or exited before the test proceeds.
  3. This test requires no additional instrumentation of the code under test. It can use standard time package timers, and it does not need to provide any mechanism for tests to synchronize with it.

A limitation of the synctest.Wait function is that it does not recognize goroutines blocked on network or other I/O operations as idle. While the scheduler can identify a goroutine blocked on I/O, it cannot distinguish between a goroutine that is genuinely blocked and one which is about to receive data from a kernel network buffer. For example, if a test creates a loopback TCP connection, starts a goroutine reading from one side of the connection, and then writes to the other, the read goroutine may remain in I/O wait for a brief time before the kernel indicates that the connection has become readable. If synctest.Wait considered a goroutine in I/O wait to be idle, this would cause nondeterminism in cases such as this,

Tests which use synctest with network connections or other external data sources should use a fake implementation with deterministic behavior. For net.Conn, net.Pipe can create a suitable in-memory connection.

This proposal is based in part on experience with tests in the golang.org/x/net/http2 package. Tests of an HTTP client or server often involve multiple interacting goroutines and timers. For example, a client request may involve goroutines writing to the server, reading from the server, and reading from the request body; as well as timers covering various stages of the request process. The combination of fake clocks and an operation which waits for all goroutines in the test to stabilize has proven effective.

@gabyhelp's overview of this issue: #67434 (comment)

@aclements
Copy link
Member

I really like how simple this API is.

Time advances when every goroutine is idle.

How does time work when goroutines aren't idle? Does it stand still, or does it advance at the usual rate? If it stands still, it seems like that could break software that assumes time will advance during computation (that maybe that's rare in practice). If it advances at the usual rate, it seems like that reintroduces a source of flakiness. E.g., in your example, the 1 second sleep will advance time by 1 second, but then on a slow system the checking thread may still not execute for a long time.

What are the bounds of the fake time implementation? Presumably if you're making direct system calls that interact with times or durations, we're not going to do anything about that. Are we going to make any attempt at faking time in the file system?

If every goroutine is idle and there are no timers scheduled, Run panics.

What if a goroutine is blocked on a channel that goes outside the group? This came to mind in the context of whether this could be used to coordinate a multi-process client/server test, though I think it would also come up if there's any sort of interaction with a background worker goroutine or pool.

or is the goroutine calling Wait.

What happens if multiple goroutines in a group call Wait? I think the options are to panic or to consider all of them idle, in which case they would all wake up when every other goroutine in the group is idle.

What happens if you have nested groups, say group A contains group B, and a goroutine in B is blocked in Wait, and then a goroutine in A calls Wait? I think your options are to panic (though that feels wrong), wake up both if all of the goroutines in group A are idle, or wake up just B if all of the goroutines in B are idle (but this block waking up A until nothing is calling Wait in group B).

@neild
Copy link
Contributor Author

neild commented May 16, 2024

How does time work when goroutines aren't idle?

Time stands still, except when all goroutines in a group are idle. (Same as the playground behaves, I believe.) This would break software that assumes time will advance. You'd need to use something else to test that case.

What are the bounds of the fake time implementation?

The time package: Now, Since, Sleep, Timer, Ticker, etc.

Faking time in the filesystem seems complicated and highly specialized, so I don't think we should try. Code which cares about file timestamps will need to use a test fs.FS or some such.

What if a goroutine is blocked on a channel that goes outside the group?

As proposed, this would count as an idle goroutine. If you fail to isolate the system under test this will probably cause problems, so don't do that.

What happens if multiple goroutines in a group call Wait?

As proposed, none of them ever wake up and your test times out, or possibly panics if we can detect that all goroutines are blocked in that case. Having them all wake at the same time would also be reasonable.

What happens if you have nested groups

Oh, I didn't think of that. Nested groups are too complicated, Run should panic if called from within a group.

@apparentlymart
Copy link

This is a very interesting proposal!

I feel worried that the synctest.Run characteristic of establishing a "goroutine group" and blocking until it completes might make it an attractive nuisance for folks who see it as simpler than arranging for the orderly completion of many goroutines using other synchronization primitives. That is: people may be tempted to use it in non-test code.

Assuming that's a valid concern (if it isn't then I'll retract this entire comment!), I could imagine mitigating it in two different ways:

  • Offer "goroutine groups" as a standalone synchronization primitive that synctest.Run is implemented in terms of, offering the "wait for completion of this and any other related goroutines" mechanism as a feature separate from synthetic time. Those who want to use it in non-test code can therefore use the lower-level function directly, instead of using synctest.Run.
  • Change the synctest.Run design in some way that makes it harder to misuse. One possible idea: make synctest.Run take a testing.TB as an additional argument, and then in every case where the proposal currently calls for a panic use t.FailNow() instead. It's inconvenient (though of course not impossible) to obtain a testing.TB implementation outside of a test case or benchmark, which could be sufficient inconvenience for someone to reconsider what they were attempting.

(I apologize in advance if I misunderstood any part of the proposal or if I am missing something existing that's already similarly convenient to synctest.Run.)

@neild
Copy link
Contributor Author

neild commented May 17, 2024

The fact that synctest goroutine groups always use a fake clock will hopefully act as discouragement to using them in non-test code. Defining goroutines blocked on I/O as not being idle also discourages use outside of tests; any goroutine reading from a network connection defeats synctest.Wait entirely.

I think using idle-wait synchronization outside of tests is always going to be a mistake. It's fragile and fiddly, and you're better served by explicit synchronization. (This prompts the question: Isn't this fragile and fiddly inside tests as well? It is, but using a fake clock removes much of the sources of fragility, and tests often have requirements that make the fiddliness a more worthwhile tradeoff. In the expiring cache example, for example, non-test code will never need to guarantee that a cache entry expires precisely at the nanosecond defined.)

So while perhaps we could offer a standalone synchroniziation primitive outside of synctest, I think we would need a very good understanding of when it would be appropriate to use it.

As for passing a testing.TB to synctest.Run, I don't think this would do much to prevent misuse, since the caller could just pass a &testing.T{}, or just nil. I don't think it would be wise to use synctest outside of tests, but if someone disagrees, then I don't think it's worth trying to stop them.

@gh2o
Copy link

gh2o commented May 18, 2024

Interesting proposal. I like that it allows for waiting for a group of goroutines, as opposed to all goroutines in my proposal (#65336), though I do have some concerns:

  • Complexity of implementation: Having to modify every time-related function may increase complexity for non-test code. Would it make more sense to outsource the mock time implementation to a third party library? The Wait() function should be sufficient for the third party library to function deterministically, and goroutines started by Run() would behave like normal goroutines in all aspects.

  • Timeouts: In my proposal, WaitIdle() returns a <-chan struct{} since it allows for a test harness to abort the test if it takes too long (e.g. 30 seconds in case the test gets stuck in an infinite loop). Would it make sense for the Wait() function here to return a chan too to allow for timeouts?

@neild
Copy link
Contributor Author

neild commented May 18, 2024

One of the goals of this proposal is to minimize the amount of unnatural code required to make a system testable. Mock time implementations require replacing calls to idiomatic time package functions with a testable interface. Putting fake time in the standard library would let us just write the idiomatic code without compromising testability.

For timeouts, the -timeout test flag allows aborting too-slow tests. Putting an explicit timeout in test code is usually a bad idea, because how long a test is expected to run is a property of the local system. (I've seen a lot of tests inside Google which set an explicit timeout of 5 or 10 seconds, and then experience flakiness when run with -tsan and on CI systems that execute at a low batch priority.)

Also, it would be pointless for Wait to return a <-chan struct{}, because Wait must be called from within a synctest group and therefore the caller doesn't have access to a real clock.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals May 18, 2024
@neild
Copy link
Contributor Author

neild commented May 22, 2024

I wanted to evaluate practical usage of the proposed API.

I wrote a version of Run and Wait based on parsing the output of runtime.Stack. Wait calls runtime.Gosched in a loop until all goroutines in the current group are idle.

I also wrote a fake time implementation.

Combined, these form a reasonable facsimile of the proposed synctest package, with some limitations: The code under test needs to be instrumented to call the fake time functions, and to call a marking function after creating new goroutines. Also, you need to call a synctest.Sleep function in tests to advance the fake clock.

I then added this instrumentation to net/http.

The synctest package does not work with real network connections, so I added an in-memory net.Conn implementation to the net/http tests.

I also added an additional helper to net/http's tests, which simplifies some of the experimentation below:

var errStillRunning = errors.New("async op still running")

// asyncResult is the result of an asynchronous operation.
type asyncResult[T any] struct {}

// runAsync runs f in a new goroutine,
// and returns an asyncResult which is populated with the result of f when it finishes.
// runAsync calls synctest.Wait after running f.
func runAsync[T any](f func() (T, error)) *asyncResult[T]

// done reports whether the asynchronous operation has finished.
func (r *asyncResult[T]) done() bool

// result returns the result of the asynchronous operation.
// It returns errStillRunning if the operation is still running.
func (r *asyncResult[T]) result() (T, error)

One of the longest-running tests in the net/http package is TestServerShutdownStateNew (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/serve_test.go#5611). This test creates a server, opens a connection to it, and calls Server.Shutdown. It asserts that the server, which is expected to wait 5 seconds for the idle connection to close, shuts down in no less than 2.5 seconds and no more than 7.5 seconds. This test generally takes about 5-6 seconds to run in both HTTP/1 and HTTP/2 modes.

The portion of this test which performs the shutdown is:

shutdownRes := make(chan error, 1)
go func() {
	shutdownRes <- ts.Config.Shutdown(context.Background())
}()
readRes := make(chan error, 1)
go func() {
	_, err := c.Read([]byte{0})
	readRes <- err
}()

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

t0 := time.Now()
select {
case got := <-shutdownRes:
	d := time.Since(t0)
	if got != nil {
		t.Fatalf("shutdown error after %v: %v", d, err)
	}
	if d < expectTimeout/2 {
		t.Errorf("shutdown too soon after %v", d)
	}
case <-time.After(expectTimeout * 3 / 2):
	t.Fatalf("timeout waiting for shutdown")
}

// Wait for c.Read to unblock; should be already done at this point,
// or within a few milliseconds.
if err := <-readRes; err == nil {
	t.Error("expected error from Read")
}

I wrapped the test in a synctest.Run call and changed it to use the in-memory connection. I then rewrote this section of the test:

shutdownRes := runAsync(func() (struct{}, error) {
	return struct{}{}, ts.Config.Shutdown(context.Background())
})
readRes := runAsync(func() (int, error) {
	return c.Read([]byte{0})
})

// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second

synctest.Sleep(expectTimeout - 1)
if shutdownRes.done() {
	t.Fatal("shutdown too soon")
}

synctest.Sleep(2 * time.Second)
if _, err := shutdownRes.result(); err != nil {
	t.Fatalf("Shutdown() = %v, want complete", err)
}
if n, err := readRes.result(); err == nil || err == errStillRunning {
	t.Fatalf("Read() = %v, %v; want error", n, err)
}

The test exercises the same behavior it did before, but it now runs instantaneously. (0.01 seconds on my laptop.)

I made an interesting discovery after converting the test: The server does not actually shut down in 5 seconds. In the initial version of this test, I checked for shutdown exactly 5 seconds after calling Shutdown. The test failed, reporting that the Shutdown call had not completed.

Examining the Shutdown function revealed that the server polls for closed connections during shutdown, with a maximum poll interval of 500ms, and therefore shutdown can be delayed slightly past the point where connections have shut down.

I changed the test to check for shutdown after 6 seconds. But once again, the test failed.

Further investigation revealed this code (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/server.go#3041):

st, unixSec := c.getState()
// Issue 22682: treat StateNew connections as if
// they're idle if we haven't read the first request's
// header in over 5 seconds.
if st == StateNew && unixSec < time.Now().Unix()-5 {
	st = StateIdle
}

The comment states that new connections are considered idle for 5 seconds, but thanks to the low granularity of Unix timestamps the test can consider one idle for as little as 4 or as much as 6 seconds. Combined with the 500ms poll interval (and ignoring any added scheduler delay), Shutdown may take up to 6.5 seconds to complete, not 5.

Using a fake clock rather than a real one not only speeds up this test dramatically, but it also allows us to more precisely test the behavior of the system under test.


Another slow test is TestTransportExpect100Continue (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/transport_test.go#1188). This test sends an HTTP request containing an "Expect: 100-continue" header, which indicates that the client is waiting for the server to indicate that it wants the request body before it sends it. In one variation, the server does not send a response; after a 2 second timeout, the client gives up waiting and sends the request.

This test takes 2 seconds to execute, thanks to this timeout. In addition, the test does not validate the timing of the client sending the request body; in particular, tests pass even if the client waits

The portion of the test which sends the request is:

resp, err := c.Do(req)

I changed this to:

rt := runAsync(func() (*Response, error) {
	return c.Do(req)
})
if v.timeout {
	synctest.Sleep(expectContinueTimeout-1)
	if rt.done() {
		t.Fatalf("RoundTrip finished too soon")
	}
	synctest.Sleep(1)
}
resp, err := rt.result()
if err != nil {
	t.Fatal(err)
}

This test now executes instantaneously. It also verifies that the client does or does not wait for the ExpectContinueTimeout as expected.

I made one discovery while converting this test. The synctest.Run function blocks until all goroutines in the group have exited. (In the proposed synctest package, Run will panic if all goroutines become blocked (deadlock), but I have not implemented that feature in the test version of the package.) The test was hanging in Run, due to leaking a goroutine. I tracked this down to a missing net.Conn.Close call, which was leaving an HTTP client reading indefinitely from an idle and abandoned server connection.

In this case, Run's behavior caused me some confusion, but ultimately led to the discovery of a real (if fairly minor) bug in the test. (I'd probably have experienced less confusion, but I initially assumed this was a bug in the implementation of Run.)


At one point during this exercise, I accidentally called testing.T.Run from within a synctest.Run group. This results in, at the very best, quite confusing behavior. I think we would want to make it possible to detect when running within a group, and have testing.T.Run panic in this case.


My experimental implementation of the synctest package includes a synctest.Sleep function by necessity: It was much easier to implement with an explicit call to advance the fake clock. However, I found in writing these tests that I often want to sleep and then wait for any timers to finish executing before continuing.

I think, therefore, that we should have one additional convenience function:

package synctest

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
	time.Sleep(d)
	Wait()
}

The net/http package was not designed to support testing with a fake clock. This has served as an obstacle to improving the state of the package's tests, many of which are slow, flaky, or both.

Converting net/http to be testable with my experimental version of synctest required a small number of minor changes. A runtime-supported synctest would have required no changes at all to net/http itself.

Converting net/http tests to use synctest required adding an in-memory net.Conn. (I didn't attempt to use net.Pipe, because its fully-synchronous behavior tends to cause problems in tests.) Aside from this, the changes required were very small.


My experiment is in https://go.dev/cl/587657.

@rsc
Copy link
Contributor

rsc commented May 23, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@gh2o
Copy link

gh2o commented May 29, 2024

Commenting here due to @rsc's request:

Relative to my proposal #65336, I have the following concerns:

  • Goroutine grouping: the only precedent for goroutine having a user-visible identity is runtime.LockOSThread(), and even then, it is set-only: a goroutine can not know whether it is locked to a thread or not without parsing runtime.Stack() output. Having these special "test mode" goroutines feels like a violation of goroutines being interchangeable anonymous workers, insofar as the Go runtime hides the goroutine ID from user code. Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).
  • Overriding standard library behavior: again, there is no precedent for standard library functions to behave differently based on what goroutine they are called from. The standard idiomatic way to do this is to define an interface (e.g. fs.FS) and direct all calls through the interface, and the implementation of the interface can be mocked at test time. If it is desirable to keep the current Run()/Wait() API, I would still strongly advocate for not changing the behavior of the standard time package, and instead incorporate a mock clock implementation in another package (likely under testing).

@neild
Copy link
Contributor Author

neild commented May 29, 2024

Regarding overriding the time package vs. providing a testing implementation:

The time package provides a number of commonly-used, exported functions, where code that makes direct use of these functions cannot be properly tested. I think this makes it unique in the standard library. For example, code which directly calls time.Sleep cannot be tested properly, because inserting a real delay inherently makes a test slow, and because there is no non-flaky way for a test to ensure that a certain amount of time has elapsed.

In contrast, we can test code which calls os.Open by providing it with the name of a file in a test directory. We can test code which calls net.Listen by listening on a loopback interface. The io/fs.FS interface may be used to create a testable seam in a system, but it isn't required.

Time is fundamentally different in that there is no way to use real time in a test without making the test flaky and slow.

Time is also different from an fs.File or a net.Conn in that there is only one plausible production implementation of time. A fs.FS might be the local filesystem, or an embedded set of static files, or a remote filesystem of some kind. A net.Conn might be a TCP or TLS connection. But it is difficult to come up with occasions outside of tests when time.Sleep should do anything other than sleep for the defined amount of time.

Since we can't use real time in tests, we can insert a testable wrapper around the time package as you propose. This requires that we avoid the idiomatic and easy-to-use time package functions. We essentially put an asterisk next to every existing function in the time package that deals with the system clock saying, "don't actually use this, or at least not in code you intend to test".

In addition, if we define a standard testable wrapper around the clock, we are essentially declaring that all public packages which deal with time should provide a way to plumb in a clock. (Some packages do this already, of course; crypto/tls.Config.Time is an example in std).

That's an option, of course. But it would be a very large change to the Go ecosystem as a whole.

@DmitriyMV
Copy link
Contributor

DmitriyMV commented May 29, 2024

the only precedent for goroutine having a user-visible identity is runtime.LockOSThread()

The pprof.SetGoroutineLabels disagrees.

insofar as the Go runtime hides the goroutine ID from user code

It doesn't try to hide it, more like tries to restrict people from relying on numbers.

Having a global wait is acceptable in the case of tests since it is unlikely for background goroutines to be present to interfere with the wait (and possibly actually desirable to catch those too).

If I understood proposal correctly, it will wait for any goroutine (and recursively) that was started using go statement from the func passed to Run. It will not catch anything started before or sidewise. Which brings the good question: @neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

@neild
Copy link
Contributor Author

neild commented May 29, 2024

@neild will it also wait for time.AfterFunc(...) goroutines if time.AfterFunc(...) was called in the chain leading to synctest.Run?

Yes, if you call AfterFunc from within a synctest group then the goroutine started by AfterFunc is also in the group.

@gh2o
Copy link

gh2o commented May 30, 2024

Given that there's more precedent for goroutine identity than I had previously thought, and seeing how pprof.Do() works, I am onboard with the idea of goroutine groups.

However, I'm still a little ambivalent about goroutine groups affecting time package / standard library behavior, and theoretically a test running in synctest mode may want to know the real world time for logging purposes (I guess that could be solved by adding a time.RealNow() or something similar). The Wait() primitive seems to provide what is necessary for a third-party package to provide the same functionality without additional runtime support, so it could be worth exploring this option a bit more.

That being said, I agree that plumbing a time/clock interface through existing code is indeed tedious, and having time modified to conditionally use a mock timer may be the lesser evil. But it still feels a little icky to me for some reason.

@aclements
Copy link
Member

Thanks for doing the experiment. I find the results pretty compelling.

I think, therefore, that we should have one additional convenience function: [synctest.Sleep]

I don't quite understand this function. Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

Nested groups are too complicated, Run should panic if called from within a group.

Partly I was wondering about nested groups because I've been scheming other things that the concept of a goroutine group could be used for. Though it's true that, even if we have groups for other purposes, it may make sense to say that synctest groups cannot be nested, even if in general groups can be nested.

@neild
Copy link
Contributor Author

neild commented May 30, 2024

Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?

You're right that sleeping past the deadline of a timer is sufficient. The synctest.Wait function isn't strictly necessary at all; you could use time.Sleep(1) to skip ahead a nanosecond and ensure all currently running goroutines have parked.

It's fairly natural to sleep to the exact instant of a timer, however. If a cache entry expires in some amount of time, it's easy to sleep for that exact amount of time, possibly using the same constant that the cache timeout was initialized with, rather than adding a nanosecond.

Adding nanoseconds also adds a small but real amount of confusion to a test in various small ways: The time of logged events drifts off the integer second, rate calculations don't come out as cleanly, and so on.

Plus, if you forget to add the necessary adjustment or otherwise accidentally sleep directly onto the instant of a timer's expiry, you get a race condition.

Cleaner, I think, for the test code to always resynchronize after poking the system under test. This doesn't have to be a function in the synctest package, of course; synctest.Sleep is a trivial two-liner using exported APIs. But I suspect most users of the package would use it, or at least the ones that make use of the synthetic clock.

I've been scheming other things that the concept of a goroutine group could be used for.

I'm very intrigued! I've just about convinced myself that there's a useful general purpose synchronization API hiding in here, but I'm not sure what it is or what it's useful for.

@rsc
Copy link
Contributor

rsc commented Jun 5, 2024

For what it's worth, I think it's a good thing that virtual time is included in this, because it makes sure that this package isn't used in production settings. It makes it only suitable for tests (and very suitable).

@rsc
Copy link
Contributor

rsc commented Jun 5, 2024

It sounds like the API is still:

// Package synctest provides support for testing concurrent code.
package synctest

// Run executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form a group.
// Run waits for all goroutines in the group to exit before returning.
//
// Goroutines in the group use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
// Time advances when every goroutine is idle.
// If every goroutine is idle and there are no timers scheduled,
// Run panics.
func Run(f func())

// Wait blocks until every goroutine within the current group is idle.
//
// A goroutine is idle if it is blocked on a channel operation,
// mutex operation,
// time.Sleep,
// a select with no cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// The caller of Wait must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Wait panics.
func Wait()

Damien suggested adding also:

// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
	time.Sleep(d)
	Wait()
}

The difference between time.Sleep and synctest.Sleep seems subtle enough that it seems like you should have to spell out the Wait at the call sites where you need it. The only time you really need Wait is if you know someone else is waking up at that very moment. But then if they've both done the Sleep+Wait form then you still have a problem. You really only want some of the call sites (maybe just one) to use the Sleep+Wait form. I suppose that the production code will use time.Sleep since it's not importing synctest, so maybe it's clear that the test harness is the only one that will call Sleep+Wait. On the other hand, fixing a test failure by changing s/time.Sleep/synctest.Sleep/ will be a strange-looking bug fix. Better to have to add synctest.Wait instead. If we really need this, it could be synctest.SleepAndWait but that's what statements are for. Probably too subtle and should just limit the proposal to Run and Wait.

@gh2o
Copy link

gh2o commented Jun 5, 2024

Some additional suggestions for the description of the Wait() function:

// A goroutine is idle if it is blocked on a channel operation,
// mutex operation (...),
// time.Sleep,
// a select operation with or without cases,
// or is the goroutine calling Wait.
//
// A goroutine blocked on an I/O operation, such as a read from a network connection,
// is not idle. Tests which operate on a net.Conn or similar type should use an
// in-memory implementation rather than a real network connection.
//
// A goroutine blocked on a direct syscall (via the syscall package) is also not idle,
// even if the syscall itself sleeps.

Additionally, for "mutex operation", let's list out the the exact operations considered for implementation/testing completeness:

  • sync.Cond.Wait()
  • sync.Mutex.Lock()
  • sync.RWMutex.Lock()
  • sync.RWMutex.RLock()
  • sync.WaitGroup.Wait()

@nightlyone
Copy link
Contributor

The API looks simple and that is excellent.

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Imagine you unit test your code but are unable to mock out a dependency. Maybe due to lack of experience or bad design of existing code I have to work with.

That dependency that suddenly starts calling a syscall (e.g. to lazily try to tune the library using a sync.Once instead of on init time and having a timeout).

Without support in testing you will never detect that now and only your tests will suddenly time out after an innocent minor dependency update.

@nightlyone
Copy link
Contributor

May I ortgogonally to the previous comment suggest to limit this package to standard library only to gather more experience with that approach before ?

That would also allow to sketch out integration with the testing package in addition to finding more pitfalls.

@neild
Copy link
Contributor Author

neild commented Jun 6, 2024

What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect.

Can you expand more on what you mean by undetected regressions?

If the code under test (either directly, or through a dependency) unexpectedly calls a blocking syscall, Wait will wait for that syscall to complete before proceeding. If the syscall completes normally (the code is using os/exec to execute a subprocess, for example), then everything should operate as expected--the operation completes and the test proceeds. If the syscall is waiting on some event (reading from a network socket, perhaps), then the test will hang, which is a detectable event. You can look at goroutine stacks from the timed-out test to analyze the reason for the hang.

Without support in testing

What kind of support are you thinking of?

@ChrisHines
Copy link
Contributor

What does this do?

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

Does it succeed or panic? It's not clear to me from the API docs because:

If every goroutine is idle and there are no timers scheduled, Run panics.

A goroutine is idle if it [...] is the goroutine calling Wait.

This is obviously a degenerate case, but I think it also applies if a test wanted to get the fake time features when testing otherwise non-concurrent code.

@gh2o
Copy link

gh2o commented Jun 6, 2024

What does this do?

func TestWait(t *testing.T) {
    synctest.Run(func() {
        synctest.Wait()
    })
}

In this case, the goroutine calling synctest.Wait() should never enter idle because there's nothing to wait for, and hence a panic should not occur.

gopherbot pushed a commit that referenced this issue Feb 10, 2025
synctest.Run waits for all bubbled goroutines to exit before returning.
Establish a happens-before relationship between the bubbled goroutines
exiting and Run returning.

For #67434

Change-Id: Ibda7ec2075ae50838c0851e60dc5b3c6f3ca70fb
Reviewed-on: https://go-review.googlesource.com/c/go/+/647755
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Damien Neil <dneil@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@ianlancetaylor ianlancetaylor marked this as a duplicate of #8869 Feb 10, 2025
@pierrre

This comment has been minimized.

@rittneje
Copy link
Contributor

@pierrre That's a known issue - see #71488.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/651555 mentions this issue: runtime: print stack traces for bubbled goroutines on synctest deadlock

@marwan-at-work
Copy link
Contributor

marwan-at-work commented Mar 3, 2025

👋🏼 Hi there, not sure if this was mentioned but time.NewTicker that doesn't call Stop does not end up panicking. The test instead just hangs forever. Here's a repro:

func TestTicker(t *testing.T) {
	synctest.Run(func() {
		time.NewTicker(time.Second)
	})
}

The above test hangs forever. If I instead call Stop, then it will pass correctly.

I expect the above should panic similar to when a goroutine is left hanging when func ends like so:

func TestDanglingGoroutine(t *testing.T) {
	synctest.Run(func() {
		go func() {
			select {}
		}()
	})
}

More general feedback:

  1. Overall, I really like how my concurrent tests are looking code-wise and the fact that tests run fast due to the fake clock is even better.
  2. Onboarding was a little difficult because even though the synctest.Run(func () { ... }) API is simple, there are a lot of implicit rules such as: all goroutines must exit by the end of the of func. That said, the rule itself was great because it forced me to write more correct code that handled cleaning up goroutines and also gave me confidence that I'm not leaking any goroutines either.

@neild
Copy link
Contributor Author

neild commented Mar 3, 2025

Thanks for the report!

time.NewTicker that doesn't call Stop does not end up panicking. The test instead just hangs forever.

This is on my known issues list. (#67434 (comment))

I think that the fix here is going to be to stop advancing time once the goroutine started by Run exits. Any leftover Tickers or Timers will not fire. If there are any goroutines blocked reading from a ticker, Run will panic with a deadlock error.

there are a lot of implicit rules such as: all goroutines must exit by the end of the of func. That said, the rule itself was great because it forced me to write more correct code that handled cleaning up goroutines and also gave me confidence that I'm not leaking any goroutines either.

Thanks especially for the experience report on this behavior. It's a bit unusual to enforce goroutine cleanup (there's a lot of code out there that leaks goroutines, especially in tests), so I'm glad to hear that it seems to have worked out as intended for you.

@dsnet
Copy link
Member

dsnet commented Mar 8, 2025

Thank you @neild for "synctest". I was recently working on a task queue execution engine, and this would have been nearly impossible to thoroughly test without either lots of hooks, tests being slow, and/or being flaky. With "synctest", the tasks execute very quickly and I can easily craft the exact conditions for potential races and verifying that our locking patterns properly handle them. I am reasonably confident in the correctness of my code because of this package.

Some thoughts:

  • I found the default starting period of 2000-01-01 unsuitable for our use-cases. We have a database that must be initialized outside of the time bubble, and so has rows in it from the current wall clock. In order to maintain monotonic progression of time transitioning from wall clock to bubble clock, I was always doing this:

    now := time.Now()
    synctest.Run(func() {
    	time.Sleep(now.UTC().Sub(time.Now().UTC())) // align bubble to real time
    	...
    })

    I wonder if the starting time should be an explicit argument to synctest.Run. I can imagine scenarios where you really do want a completely isolated bubble of time with no correlation to the wall clock. Other times (like in my case), you need to connect the wall clock and bubble clock in some way.

  • I believe that calling time.Now inside the bubble should not record a monotonic reading. Consider the above example, if you instead do:

    now := time.Now()
    synctest.Run(func() {
    	time.Sleep(now.Sub(time.Now())) // overflows in arithmetic
    	...
    })

    this will not work correctly since calling time.Now in both cases will record a monotonic reading. Consequently, the time.Time.Sub method will try to use the monotonic readings to perform arithmetic, but be completely off since the monotonic readings are from a completely different clock context. I argue that time.Now within a bubble should not have a motonic reading since time can only ever progress forwards in time, so there is no need for a monotonic reading. Thus arithmetic between time.Time that both originated in the bubble work correctly, and also arithmetic between time.Time from inside and outside the bubble.

  • I sometimes found myself questioning whether "synctest" was working correctly (turned out to actual bugs in my code), but it would be interesting if there could be a mode where "synctest" runs with a real wall clock, instead of a synthetic clock. Obviously tests run slower, but provide the guarantee that "synctest" isn't masquerading some real problem.

@neild
Copy link
Contributor Author

neild commented Mar 8, 2025

I found the default starting period of 2000-01-01 unsuitable for our use-cases.

In the case where you want to start the synthetic time at some other point, I think time.Sleep(when.Sub(time.Now()) is not much worse than any other option. Most tests don't care about the specific time, so I think it's okay if we put a tiny bit more work on the ones that do rather than requiring everyone to specify a starting time.

2000-01-01 has a few advantages:

  • It's very obviously not the current time, so if you leave the bubble clock alone it's easy to distinguish real and fake times.
  • It's consistent. We could start the bubble time at the current time, but then each test run would be different.
  • It's in the past, so if you want to test behavior at some specific point in time you can time.Sleep until that time. For example, if you want to test your code's behavior at a leap year transition, you can pick any leap year after 2000 without worrying that your test will stop working at some point in the future.

I believe that calling time.Now inside the bubble should not record a monotonic reading.

Oops, that's a bug. I thought we weren't recording a monotonic reading. I'll fix that.

but it would be interesting if there could be a mode where "synctest" runs with a real wall clock, instead of a synthetic clock

That's an interesting thought. I think it would need to be a GODEBUG setting or similar; we don't want to make it easy to use synctest.Wait in non-test code. (Perhaps there's a good use case for a Wait-style operation in production code, but if so we should design that as a separate feature outside of synctest.)

Would the bubble clock still start at a synthetic base time? I can see some tricky concerns either way.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/657415 mentions this issue: runtime, time: don't use monotonic clock inside synctest bubbles

@mvdan
Copy link
Member

mvdan commented Mar 14, 2025

What is the general guidance for integration tests of HTTP servers which currently use https://pkg.go.dev/net/http/httptest? #71345 correctly points out that the docs say:

A goroutine executing a system call or waiting for an external event such as a network operation is not durably blocked. For example, a goroutine blocked reading from an network connection is not durably blocked even if no data is currently available on the connection, because it may be unblocked by data written from outside the bubble or may be in the process of receiving data from a kernel network buffer.

While this is true in the general sense, in practice many tests are currently written with httptest in such a way that it's only the same Go test and goroutines which interact via HTTP clients. I have some of those tests which currently use a fake clock to mimic e.g. cookie or access token expiry, and it's not clear to me how I should attempt to use synctest in these tests.

@comerc

This comment has been minimized.

@paskozdilar

This comment has been minimized.

@seankhliao
Copy link
Member

I think for httptest we'd need #14200
net/http seems to already have an internal implementation at https://go.googlesource.com/go/+/bceade5ef8ab6d28ad363cd7ca60a9be89990a00/src/net/http/netconn_test.go#5

@neild
Copy link
Contributor Author

neild commented Mar 17, 2025

You can use synctest and httptest together, but it's a bit tricky at the moment. You need to provide a fake network, which you can inject by setting httptest.Server.Listener and http.Transport.DialContext to appropriate fake implementations.

A very barebones example:
https://go.dev/play/p/HJQKRKuEYQf

The net/http package contains a more full-featured approach in its internal tests:
https://go.googlesource.com/go/+/refs/tags/go1.24.0/src/net/http/clientserver_test.go#190

If synctest is promoted out of experimental status, I think we'll want to consider whether we should provide an easy-to-use fake network for httptest (#14200).

gopherbot pushed a commit that referenced this issue Mar 18, 2025
Don't include a monotonic time in time.Times created inside
a bubble, to avoid the confusion of different Times using
different monotonic clock epochs.

For #67434

goos: darwin
goarch: arm64
pkg: time
cpu: Apple M1 Pro
         │ /tmp/bench.0 │            /tmp/bench.1            │
         │    sec/op    │   sec/op     vs base               │
Since-10    18.42n ± 2%   18.68n ± 1%       ~ (p=0.101 n=10)
Until-10    18.28n ± 2%   18.46n ± 2%  +0.98% (p=0.009 n=10)
geomean     18.35n        18.57n       +1.20%

Change-Id: Iaf1b80d0a4df52139c5b80d4bde4410ef8a49f2f
Reviewed-on: https://go-review.googlesource.com/c/go/+/657415
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Damien Neil <dneil@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@SpikesDivZero
Copy link

I've been using synctest and am pretty happy, so you all have my gratitude for this experiment.

One rough edge I've felt concerns its potential interaction with testing.T.Context, also introduced in go1.24. Should users regard it as safe to use t.Context().Done() inside a synctest bubble?

This is reasonably safe as of the current version but may not be safe in future versions.

If we want the T Context to be safe to use inside a bubble, then there's a case to be made for an added test. I took a look around and didn't see a test covering this.

This may also add a point in favor of the earlier T.RunIsolated (or similar) discussion above: preventing a user from accessing the Done channel before entering the synctest bubble.

// Trivialized example of how a user may use testing.T.Context.Done

func codeUnderTest(ctx context.Context, timeout time.Duration) {
	select {
	case <-ctx.Done():
	case <-time.After(timeout):
	}
}

func TestTContextDone(t *testing.T) {
	synctest.Run(func() {
		const timeout = time.Hour // as if from a table test

		t0 := time.Now()
		codeUnderTest(t.Context(), timeout)
		if d := time.Since(t0); d != timeout {
			t.Errorf("duration is %v, wanted %v", d, timeout)
		}
	})
}

@neild
Copy link
Contributor Author

neild commented Mar 24, 2025

I've been thinking on the interactions between the synctest and testing packages. T.Context is a problem, as is T.Cleanup.

I've got a few ideas so far that seem workable, with various tradeoffs.

The simplest I can think of is to add synctest.Start (and probably remove synctest.Run):

package synctest

// Start places the current goroutine in a new synctest bubble.
// The goroutine must not be in a bubble already.
func Start()

We could then make the testing package aware of bubbles: T.Context would return a different (bubbled) context, T.Cleanup would register functions that run inside the bubble, and the testing package would wait for bubbled goroutines to exit before finishing a test.

This has the advantage of simplicity, but the disadvantage of losing the clean delineation of the bubble lifetime that synctest.Run provides.

Another option would be to provide a function that starts a bubble with a testing.T, B, or F that is bubble-aware. This could be in the synctest package:

package synctest

// Test runs f in a bubble in a new goroutine.
func Test[T testing.TB](t T, f func(T))

Or in the testing package:

package testing

// RunIsolated runs f as a subtest in a new bubble.
func (t *T) RunIsolated(t *testing.T, name string, f func(*T))

@thatnealpatel
Copy link
Member

TL;DR

I am exceedingly happy with synctest.

My use-case is a single-binary that implements an execution platform.
The binary is instrumented using concurrency primitives to handle tight-looped, time-based events alongside event-based triggers.
In addition, it uses REST to communicate with a broker and SQLite for state management (via sqlc code generation).

My main (self-inflicted) pain points were:

  1. New solution for database durability between epochs when using synctest
  2. Re-instrumenting HTTP integrations to keep as much production code in the system under test (SUT) as possible

The only downside was giving up remote logging (via cloud.google.com/go/logging) in my integration tests. However, given the sheer utility of synctest, this was an extremely acceptable trade-off. Ironically, a majority of the value to me for having these remote logs was to aid in debugging test flakes due to injected clocks.

Despite those minor points, in <6 hours, I was able to:

  1. Fully convert to use synctest idioms
  2. Produce a net negative code diff
  3. Reduce practical* testing execution time from 100s -> 8s**
  4. Eliminate all non-deterministic behavior and test flakes
  5. Increase expressiveness of integration test cases through generalization around durable test time
  6. Catch subtle concurrency bugs (e.g. leaked testing goroutines)

It would be an understatement to say that I would be sad if synctest did not make it into go1.25.


*as further noted below, I was virtually forced to use -race to get my tests to mostly pass with injected clocks. With synctest, this is no longer true; therefore, this is a material performance gain to me. The "real diff" is approximately 60% faster integration test execution (18s->8s).

**run on M3 Macbook Air

Details

General Architecture

The following is an illustrative subset of the use-case I have; most notably it includes:

  • frequent network I/O via HTTP
  • frequent disk I/O via database interactions in SQLite
  • highly concurrent execution using mixed time-based / event-based channels

Prior to synctest, the below would use https://github.com/jmhodges/clock.Clock passed via injected Params for each of the components.

package broker

// Broker dispatchers HTTP to the broker via REST.
func Broker(ctx context.Context, ...) (chan<- Request, chan<- Refresh) {

  reqs := make(chan Request)
  refresh := make(chan Refresh)
  db, cleanup, _ := dataplane.OpenRWConnection(dataplane.Broker, ...)
  clientCache := make(map[dataplane.ExecutorT]*http.Client) // oauth2 details abstracted away here
  logclient, flush, _ := monitoring.NewCloudLogger(...)

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        cleanup()
        return
      case <-time.After(time.Minute): // lightweight thing
      case r := <-reqs:               // 5 QPS, network I/O thing, uses `clientCache`
      case r := <-refresh:            // infrequent, network I/O thing, mutates `clientCache`
      }
    }
  }()
  return reqs, refresh
}
package cache

// CacheLayer provides synchronized access to post-processed broker data.
func CacheLayer(ctx context.Context, dispatch chan<- broker.Request, ...) chan<- Request {

  reqs := make(chan Request)
  logclient, flush, _ := monitoring.NewCloudLogger(...)
  cache := make(map[any]any) // abstracting details away

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        return
      case <-time.After(time.Second): // heavy calculation (1000s of PDEs), network I/O via `dispatch`
      case r := <-reqs:               // 50 QPS, uses post-processed cache
      }
    }
  }()
  return reqs
}
package executor 

// Executor performs core business-logic using the data via the broker.
func Executor(ctx context.Context, dispatch chan<- broker.Request, oracle chan<- cache.Request, ...) {

  db, cleanup, _ := dataplane.OpenRWConnection(dataplane.Executor, ...)
  logclient, flush, _ := monitoring.NewCloudLogger(...)
  trig := make(chan struct{}, 1)

  go func() {
    for {
      select {
      case <-ctx.Done():
        flush()
        cleanup()
        return
      case <-time.After(time.Minute):   // uses `db`, `dispatch`, `oracle` [cheap, non-critical]
      case <-time.After(4*time.Second): // uses `db`, `dispatch`, `oracle` [expensive, critical]
      case <-trig:                      // uses `db`, `dispatch`, `oracle` [cheap, critical]
      }
    }
  }()
}

Testing Strategy

Previous to synctest, I had to use injected clocks (clock.FakeClock); I used https://github.com/jmhodges/clock which was perfectly fine for a simple API.

My testing platform implements 2 fake components:

  1. simulated broker platform (to be used with httptest) that allows for dynamic injection of server-side logic to test my production client-side code in the SUT
  2. simulated, time-dependent data platform (used in the simulated broker platform)

These components are implemented in a similar manner to the above; there are a mix of time-based and event-based triggers with similar costs and criticality. I chose this strategy due to the high coverage of production code in a SUT. Additionally, it covers more components than those listed here for free.

Some details are omitted below; however, it is a faithful representation of the use-case.

package executor 

func setupExecutorIntegrationTest(ctx context.Context) (chan<- sim.Request, func(), *sql.DB) { // with sqlc, you can just return `*<db-pkg>.Queries` in place of `*sql.DB`
  brokersim := sim.BrokerPlatform(ctx, ...)
  httpclient, netclean , _ := newTestHttpServer(ctx, brokersim) // large-ish implementation of switch { case ... } on real API paths

  dispatch, refresh := broker.Broker(ctx, ...) // via abstraction, httpclient.Client() is passed here
  oracle := cache.CacheLayer(ctx, dispath, ...)
  Executor(ctx, dispatch, oracle, ...) // in-memory db via abstraction

  db, dbclean, _ := dataplane.OpenReadOnlyConnection(dataplane.Executor, ...) // in-memory db

  return brokersim, func() { netclean(); dbclean() }, db
}

Downsides

There are numerous, well-known downsides to this sort of testing instrumentation; in my case, the most relevant:

  1. speed of the integration test
  2. ease of parallelization of the integration tests
  3. flakiness of the integrations tests

For example, I virtually was forced to run all my tests with -race to eliminate most (but not all) flakiness associated with spinning the clock and time-dependent calculations:

func TestIntegration_SomeCase(...) {
  t.Parallel()
  ctx, cancel := context.WithTimeout(context.Background(), defaultTimeout)

  brokersim, clk, cleanup, db := setupExecutorIntegrationTest(ctx)
  t.Cleanup(cleanup)
  clk.Set(...)

  simdata(brokersim, ...) // do some set up
  for range 120 { // invariant is average flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior

  simdata(brokersim, ...) // do some set up
  for range 50 { // invariant is less flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior

  simdata(brokersim, ...) // do some set up
  for range 250 { // invariant is more flaky
    clk.Add(time.Second)
  }
  assert(t, db, ...) // invariant for event-based / time-based behavior
}

synctest to the rescue

Before synctest, I had no concrete way to write more expressive, generalized integration test cases; however, the determinism allowed me to do so.

The learning curve was not as steep as I expected; however, I spent most of the time tracking down leaked goroutines.

key challenge: giving up cloud.google.com/go/logging in SUT

Due to how fast synctest is, the asynchronous logging buffers fill up too quickly for the library to flush; though I did not put much effort into debugging that (<60 seconds), it appears that the library makes some call into the bubble causing a panic. This behavior does not happen when using -race which "naturally" slows down execution.

As noted above, I did not spend much effort here because the primary use-case for having remote logging in my integration tests was to debug the very problems synctest solved for me.

My solution was simply to make my monitoring package aware of test.v; instead of bringing up the actual CloudLogging client in my thin wrapper, it simply sets all of the logging interfaces to log.Default() then io.Discards them in tests.

key challenge: http.Client -> http.Transport

The snippet generously provided by @neild at https://go.dev/play/p/HJQKRKuEYQf was virtually a drop-in replacement. I threw it into a ./internal/fakehttp package (for the time being).

I only needed to change the semantics of using *http.Client everywhere (e.g. Broker) to using http.RoundTripper (via *http.Transport). This was material to me because it allowed me to keep production code in the SUT which had some oauth2 components used for token exchange.

Due to my general lack of cleanup knowledge, the order in which these resources needed to get cleaned up did take a few minutes to grok; however, once I found the correct permutation, it all made sense.

key challenge: SQLite

Prior, all the databases used in the SUT were in-memory (e.g. file:mydb?mode=memory).

Due to my specific use-case, there are many events that fire (as seen in the general architecture section) in a given day; in addition, there are many time+event based triggers that happen 6+ hours a part that are critical invariants to test.

Obviously, synctest majorly speeds up the testing vs. real wall-time; however, with a "congested" SUT, it still takes materially long to finish. With injected clocks, it was easy to just make a clk.Set(...) call.

In order to preserve this behavior of "run something for 17 seconds now, then in 6 hours check this invariant that affects that thing," my solution was to serialize the states of the SUT (both the SQLite portion and the sim.BrokerPlatform state) as .testdb and .json flat-files, respectively. For non-SQLite implementations, I imagine a similar approach would be taken to avoid making uninteresting network calls.

In effect, every test case uses ephemeral ./testdata/x.testdb and ./testdata/x.json that it is responsible for creating and cleaning up. This allowed me to use the following paradigm to greatly speed up invariant testing without loss of durability:

type epoch struct {
  start       time.Time
  timeout     time.Duration
  sleep       time.Duration
  sim:        []sim.Request
  invariants: map[invT]any // abstracting details away here
}

func runEpochs(t *testing.T, configs []config, epochs []epoch) {
  t.Helper()
  for _, e := range epochs {
    time.Sleep(e.start.Sub(time.Now())
  
    ctx, cancel := context.WithTimeout(context.Background(), e.timeout)
    defer cancel()
    timeout := false
    context.AfterFunc(ctx, func() { timeout = true })

    brokersim, cleanup, db := setupExecutorIntegrationTest(ctx, t.Name(), configs) // serializes to `.json`, `.testdb`
    defer cleanup()

    simdata(brokersim, e.sim)
    time.Sleep(e.sleep)
    synctest.Wait()

    if timeout {
      t.Fatalf("%s timed out", t.Name())
    }
    assert(t, e.invariants)

    cleanup()
    cancel()
    synctest.Wait()
  }
}

func TestIntegration_SomeCase(t *testing.T) { 
  t.Parallel()
  defer func() { os.Remove(...); os.Remove(...) }()
  configs := []config{{...}}
  epochs := []epoch{{...}, {...}, {...}}
  runEpochs(t, configs, epochs) // wow!
}

While it is true that generalizing some notion of epoch for test cases could have happened before synctest, I had virtually no confidence in doing so because non-durably blocked execution would have yielded a bloated struct that just made the flake-reducing behavior hidden and less readable.

@gh2o
Copy link

gh2o commented Mar 26, 2025

@neild

I've been thinking on the interactions between the synctest and testing packages. T.Context is a problem, as is T.Cleanup.

To avoid introducing an additional level of nesting or function indirection, what do you think about adding a T.Isolated method that functions similarly to T.Parallel?

// Isolated signals that this test is to be run inside an isolated "bubble".
// The test function will first be run as normal. When this function is called
// for the first time, the test function will exit immediately (via runtime.Goexit()),
// and then the test function will be called again from a bubbled goroutine.
// The second call to this function will be a no-op.
func (t *T) Isolated()

We can then write tests in the following manner:

func TestSleep(t *testing.T) {
  t.Isolated()

  a := time.Now()
  time.Sleep(24 * time.Hour)
  b := time.Now()

  if b.Sub(a) != 24 * time.Hour {
    t.Error("sleep not exact")
  }
}

T.Context and T.Cleanup can now also interact correctly with the bubble, now that the T type itself is bubble-aware.

@neild
Copy link
Contributor Author

neild commented Mar 26, 2025

T.Isolated is essentially the same as synctest.Start with a different spelling. Both replace the current Run operation of running a function in a bubble with an operation that moves the current goroutine into a bubble.

T.Isolated has the advantage of being scoped to the testing package.

You still need to call synctest.Wait in many circumstances, so the bubble API is now spread across two packages. We could address this by dropping the synctest package entirely and moving Wait into the testing package. An advantage of the current layout is that all the documentation for bubbles naturally exists in testing/synctest. We would lose that, but perhaps the loss is balanced by better discoverability in the testing package.

If we did keep the testing/synctest package, the name T.Isolated doesn't suggest a connection to that package. Perhaps we could rename either T.Isolated or testing/synctest to make the relation clearer. Or if we move synctest.Wait into testing then this problem does not apply.

@apparentlymart
Copy link

I personally like t.Isolated because it suggests that the scope of the bubble is connected with the test that called it, even if the other functions for manipulating an already-active bubble live elsewhere.

But if significant API changes are still on the table, a further idea that the above discussion inspired was to have t.Isolated return an object of a type that could itself still be defined in the synctest packages -- let's say synctest.Bubble for the sake of example -- and have the bubble manipulations be methods of that type rather than package-level functions.

This would mean that the bubble manipulations can be done only by code that has access to that "bubble" object. I could imagine arguing that both as an advantage (it means you don't need to worry about spooky action at a distance with random code elsewhere fiddling with the bubble in ways that are not obvious from the test code) and as a disadvantage (you now have another value in addition to t that needs to be passed around to all of your test helper functions).

@neild
Copy link
Contributor Author

neild commented Mar 26, 2025

I've created a sub-issue with a proposal for a revised version of the API that replaces Run with Start: #73062

(Creating as a separate proposal to keep the discussion of this specific change organized.)

@MarcoPolo
Copy link

I've been using synctest lately, and it's pretty nice. I have one small request. Could we add to the comment on Wait to specifically call out that goroutines blocked on Mutexes are not durably blocked? I know that an astute reader would notice that Mutexes are not in list of operations that would durably block and thus infer that a call to Mutex.Lock would not durably block a goroutine in a bubble. However, a small change to the comment text would make this clearer.

Consider also that there were a lot of discussions about whether a goroutine blocked on a Mutex should be durably blocked, and the initial proposal text mentioned mutex operations as supported. To quote Russ:

Someone can implement what is semantically a mutex using (a) sync.Mutex, (b) channels, or (c) sync.Cond.
It seems weird that one of these methods has different behavior than the other two.
It seems like all three should be acceptable, not just sync.Mutex.

I agree it's weird and especially non-obvious. I understand the reasons for not support sync.Mutex right now. I'm not suggesting a change to support them. I am suggesting an addition to the comment on Wait to explicitly call out mutex operations as a non-durably blocking.

@stroiman
Copy link

stroiman commented Mar 29, 2025

First of all, I haven't used synctest, so my comment is purely based on reading the docs. But I'd like to see more control over the wall clock.

For testing line-of-business applications, setting a specific wall clock is often extremely useful in tests. E.g, setting the time to 23:59:59 the last day of the month, vs. 00:00:01 the following day, to check if a rule was applied. This becomes even more critical for multi-time zone systems (i.e., the ability to control UTC wall clock for testing a multi-time zone system is critical)

According to the docs.

Goroutines in the bubble use a synthetic time implementation. The initial time is midnight UTC 2000-01-01.

This should make it technically possible to forward time to the desired simulated wall clock.

time.Sleep(MustParseISOTime("2025-04-01T12:00-01:00").Sub(time.Now()))

But it's not as explicit as something like SetTime(MustParseISOTime("2025-04-01T12:00-01:00")).

A suggestion could be passing a value to the bubble to interact with time itself?

type T interface {
	SetTime(time.Time)
}
Run(func(T))

@neild
Copy link
Contributor Author

neild commented Mar 31, 2025

This should make it technically possible to forward time to the desired simulated wall clock.

This is exactly the right way to set a precise simulated time: time.Sleep until the desired point in time.

We could add a SetTime function, but this would be additional API surface with no additional functionality. We'd have to define what happens when time goes backwards. The interactions with other goroutines also seem ambiguous: If one goroutine calls Sleep(1*Time.Second) and another calls SetTime(time.Now().Add(2*time.Second)), does the first goroutine wake after one simulated second or two?

Using time.Sleep is (in my opinion) clearer: Time can't go backward, and intervening events occur before the sleeping goroutine wakes.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/662155 mentions this issue: runtime, testing/synctest: stop advancing time when main goroutine exits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests