| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
Any is the preferred over interface{} now in Go.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The whole pkg/repro algorithm is very sensitive to random kernel
crashes, yet all other parts of the system rely on pkg/repro reproducers
being reliable enough to draw meaningful conclusions from running them.
A single unrelated kernel crash during repro extraction may divert the
whole process since all the checks we do during the process (e.g. during
minimization or when we drop prog opts) assume that if the kernel didn't
crash, it was due to the fact that the removed part was essential for
reproduction, and not due to the fact that our reproducer is already
broken.
Since such problem may happen at any moment, let's do a single
validation check at the very end of repro generation. Overall, these
cases are not super frequent, so it's not worth it to re-check every
step.
Calculate the reliability score of thre reproducer and use a 15% default
cut-off for flaky results.
|
| |
|
|
| |
This will aid in debugging the tests that failed.
|
| |
|
|
|
| |
Move C repro generation from syz-manager to pkg/repro to avoid code
duplication.
|
| |
|
|
|
| |
If an error happened during prog minimization, abort it instead of
trying to proceed further.
|
| |
|
|
|
|
|
|
|
|
| |
Make the pool.Run() function take a context.Context to be able to abort
the callback passed to it or abort its scheduling if it's not yet
running.
Otherwise, if the callback is not yet started and the pool's Loop is
aborted, we risk waiting for pool.Run() forever. It prevents the normal
shutdown of repro.Run() and, consequently, the DiffFuzzer functionality.
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
Refactor pkg/repro to accept a context.Context object. This will make it
look more similar to other package interfaces and will eventually let us
abort currently running repro jobs without having to shut down the whole
application.
Simplify the code by factoring out the parameters common both to RunSyzRepro()
and RunCRepro().
|
| |
|
|
|
|
|
|
|
| |
Pools and ReproLoop and always created on start,
so there is no need to support lazy set for them.
It only complicates code and makes it harder to reason about.
Also introduce vm.Dispatcher as an alias to dispatcher.Pool,
as it's the only specialization we use in the project.
|
| |
|
|
|
| |
Some of the levels were just too high, especially considering that the
messages are printed via log.Logf().
|
| |
|
|
|
| |
It's to be used in case a quick reproducer is necessary. It omits C
repro generation and a number of option simplifications.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we kill hanged processes and consider the corresponding test finished.
We don't kill/wait for the actual test subprocess (we don't know its pid to kill,
and waiting will presumably hang). This has 2 problems:
1. If the hanged process causes "task hung" report, we can't reproduce it,
since the test finished too long ago (manager thinks its finished and
discards the request).
2. The test process still consumed per-pid resources.
Explicitly detect and handle such cases:
Manager keeps these hanged tests forever,
and we assign a new proc id for future processes
(don't reuse the hanged one).
|
| |
|
|
|
|
|
|
|
|
| |
Ideally, we should be mindful of that during the whole repro process,
but there's always a chance that different titles are the manifestations
of the same problem.
So let's stay tolerant to different titles during prog extraction and
minimization, but carefully check them during opt simplifications and C
repro extraction.
|
| |
|
|
|
|
|
| |
Our largest timeout is 6 minutes, so anything between 1.5 minutes and 6
ended up having a 9 minute timeout. That's too much.
Consider the time it actually took to crash the kernel.
|
| |
|
|
|
| |
15 seconds is an unreasonably small timeout. Let's do at least 30
seconds first, then at least 100 seconds.
|
| |
|
|
|
|
| |
1) If we know the tentative reproducer, try only it before the
bisection. It's the best single candidate program.
2) During bisection, never drop the program.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
All callers shouldn't control lots of internal details of minimization
(if we have more params, that's just more variations to test,
and we don't have more, params is just a more convoluted way to say
if we minimize for corpus or a crash).
2 bools also allow to express 4 options, but only 3 make sense.
Also when I see MinimizeParams{} in the code, it's unclear what it means.
Replace params with mode.
And potentially "crash" minimization is not "light", it's just different.
E.g. we can simplify int arguments for reproducers (esp in snapshot mode),
but we don't need that for corpus.
|
| | |
|
| |
|
|
|
| |
It seems that this error may come up in absolutely valid and reasonable
cases. Restore the special casing.
|
| |
|
|
|
| |
Now that we do not take the programs from the SSH-based logs, the error
does look surprising, so let's print it with log.Errorf().
|
| |
|
|
|
|
| |
Minimizing to 0 calls leads to an empty execution log, which leads to an
immediate exit of tools/syz-execprog, which would be recognized as "lost
connection to machine".
|
| |
|
|
|
| |
Rely on instance.Pool to perform fuzzing and do bug reproductions.
Extract the reproduction queue logic to separate testable class.
|
| |
|
|
|
|
|
|
|
|
| |
When we combine the progs found during prog bisection, there's a chance
that we may exceed the prog.MaxCalls limit. In that case, we get a
SYZFATAL error and proceed as if it were the target crash. That's
absolutely wrong.
Let's first minimize each single program before concatenating them, that
should work for almost all cases.
|
| |
|
|
| |
Add an explicit parameter to only run call removal.
|
| |
|
|
|
|
|
|
|
|
|
| |
When we accept new kernels for fuzzing we need more extensive testing,
but syz-ci switched to using syz-manager for this purpose.
Now instance testing is used only for bisection and patch testing,
which does not need such extensive image testing (it may even harm).
So just run a simple program as a testing.
It also uses the same features as the target reproducer,
so e.g. if the reproducer does not use wifi, we won't test it,
which reduces changes of unrelated kernel bugs.
|
| |
|
|
|
| |
If C reproducers keep on printing "executing program" lines, it will be
easier to re-use them during the repro and patch testing.
|
| |
|
|
|
|
|
| |
Start switching from host.Features to flatrpc.Features.
This change is supposed to be a no-op,
just to reduce future diffs that will change
how we obtain features.
|
| |
|
|
|
|
| |
In many cases bisection does not seem to bring any results, but it takes
quite a while to run. Let's save some time by running the whole log
before the proces.
|
| |
|
|
| |
It feels like 8 is a bit too high number. Let's stop reproduction earlier.
|
| | |
|
| |
|
|
|
|
|
|
| |
If an instance fails to boot 3 times in a row,
we drop it on the floor. If we drop all instances this way,
the repro process deadlocks infinitly.
Retry infinitly instead. There is nothing else good we can do
in this case. If the instances becomes alive again, repro will resume.
|
| |
|
|
|
| |
Now it looks like a failure of the whole reproduction process.
Adjust the message to reduce confusion.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Retrying once has greatly reduced the number of "failed to copy prog to
VM" errors, but they still periodically pop up. The underlying problem
is still not 100% known.
Supposedly, if a booted VM with an instrumented kernel has to wait too long,
it can just hang or crash by itself. At least on some problematic
revisions.
Investigation wouldbe quite time-consuming -- we need to do a
complicated refactoring in order to also capture serial output for
Copy() failures. So far it does not seem to be totally worth it.
Let's do 3 runOnInstance() attempts. If the problem still persists,
there's no point in doing more runs -- we'd have to determine the
exact root cause.
|
| |
|
|
|
|
|
|
|
|
|
| |
In the current code, there's a possibility that we write to ctx.bootRequests
after it was quickly closed. That could happen when we immediately abort
the reproduction process after it's started.
To avoid this, don't send elements over the bootRequests channel in the
createInstances() function.
Hopefully closes #4016.
|
| | |
|
| |
|
|
|
| |
This will help avoid a circular dependency pkg/vcs -> pkg/report ->
pkg/vcs.
|
| |
|
|
| |
Amend oops and oopsFormat to contain report type.
|
| |
|
|
|
| |
If the feature is supported on the device, allocate a 128MB swap file
after VM boot and activate it.
|
| |
|
|
|
| |
Otherwise we're getting "repro failed: all VMs failed to boot" pkg/repro
errors if a sykaller instance is shutting down.
|
| |
|
|
|
|
|
| |
Most of those errors seem to be transient, so there's no sense to fail
the whole C repro generation process.
Give it one more chance and only fail after that.
|
| |
|
|
|
|
|
| |
Only use ctx.bootRequests to indicate that no further VMs are needed.
Do not return from Run() until we have fully stopped the VM creation
loop as there's a risk it might interfere with fuzzing.
|
| |
|
|
|
| |
This is a sanity test for the overall pkg/repro machinery. It does not
focus on minor corner cases.
|
| |
|
|
|
| |
Interact with a syz-execprog instance via an additional interface. This
will simplify testing.
|
| |
|
|
|
|
| |
Split Run() into several functions to facilitate testing.
This commit does not introduce any functional changes.
|
| |
|
|
|
| |
It's not entirely normal, but it can still happen and it's not a big
problem by itself. Let's not pollute our error logs.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for moving a NIC PCI pass-through VF into Syzkaller's network
namespace so that it will tested. As DEVLINK support is triggered by
setting the pass-through device to "addr=0x10", NIC PCI pass-through VF
support will be triggered by setting the device to "addr=0x11".
If a NIC PCI pass-through VF is detected in do_sandbox, setup a staging
namespace before the fork() and transfer the NIC VF interface to it.
After the fork() and in the child transfer the NIC VF interface to
Syzkaller's network namespace and rename the interface to netpci0 so
that it will be tested.
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
|
| |
|
|
|
|
|
|
|
|
| |
Previously it was copypasted in pkg/instance, pkg/repro,
tools/syz-crash. Use the single implementation instead.
Also, this commit fixes a bug - the previous code always set collide to
true while reproducing a bug, which led to an immediate syz-exexprog's
exit. As a result, newer bugs with .syz repro only were never actually
reproduced on #syz test requests.
|
| |
|
|
|
|
|
| |
We have "suppressions" parameter to suppress non-interesting reports.
Add "interests" parameter which is an opposite of "suppressions" --
everything that's not in "interests" is suppressed.
It's matched against bug title, guilty file and maintainer emails.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Replace the currently existing straightforward approach to race triggering
(that was almost entirely implemented inside syz-executor) with a more
flexible one.
The `async` call property instructs syz-executor not to block until the
call has completed execution and proceed immediately to the next call.
The decision on what calls to mark with `async` is made by syz-fuzzer.
Ultimately this should let us implement more intelligent race provoking
strategies as well as make more fine-grained reproducers.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Now that call properties mechanism is implemented, we can refactor
fault injection.
Unfortunately, it is impossible to remove all traces of the previous apprach.
In reprolist and while performing syz-ci jobs, syzkaller still needs to
parse the old format.
Remove the old prog options-based approach whenever possible and replace
it with the use of call properties.
|