aboutsummaryrefslogtreecommitdiffstats
path: root/pkg/repro
Commit message (Collapse)AuthorAgeFilesLines
* all: use any instead of interface{}Dmitry Vyukov2025-12-221-4/+4
| | | | Any is the preferred over interface{} now in Go.
* pkg/repro: validate the resultAleksandr Nogikh2025-06-262-11/+139
| | | | | | | | | | | | | | | | | | | | | The whole pkg/repro algorithm is very sensitive to random kernel crashes, yet all other parts of the system rely on pkg/repro reproducers being reliable enough to draw meaningful conclusions from running them. A single unrelated kernel crash during repro extraction may divert the whole process since all the checks we do during the process (e.g. during minimization or when we drop prog opts) assume that if the kernel didn't crash, it was due to the fact that the removed part was essential for reproduction, and not due to the fact that our reproducer is already broken. Since such problem may happen at any moment, let's do a single validation check at the very end of repro generation. Overall, these cases are not super frequent, so it's not worth it to re-check every step. Calculate the reliability score of thre reproducer and use a 15% default cut-off for flaky results.
* pkg/repro: add logging to repro testsAleksandr Nogikh2025-06-262-9/+17
| | | | This will aid in debugging the tests that failed.
* syz-cluster: report reproducers for findingsAleksandr Nogikh2025-06-231-0/+12
| | | | | Move C repro generation from syz-manager to pkg/repro to avoid code duplication.
* pkg/repro: abort prog minimization on errorAleksandr Nogikh2025-04-241-1/+16
| | | | | If an error happened during prog minimization, abort it instead of trying to proceed further.
* vm/dispatcher: make pool.Run cancellableAleksandr Nogikh2025-04-232-3/+8
| | | | | | | | | | Make the pool.Run() function take a context.Context to be able to abort the callback passed to it or abort its scheduling if it's not yet running. Otherwise, if the callback is not yet started and the pool's Loop is aborted, we risk waiting for pool.Run() forever. It prevents the normal shutdown of repro.Run() and, consequently, the DiffFuzzer functionality.
* tools/syz-execprog: support running unsafe programsDmitry Vyukov2024-11-261-1/+1
|
* pkg/repro: accept a cancellable contextAleksandr Nogikh2024-11-133-83/+80
| | | | | | | | | | Refactor pkg/repro to accept a context.Context object. This will make it look more similar to other package interfaces and will eventually let us abort currently running repro jobs without having to shut down the whole application. Simplify the code by factoring out the parameters common both to RunSyzRepro() and RunCRepro().
* pkg/manager: set more http fields before calling ServeDmitry Vyukov2024-11-072-4/+3
| | | | | | | | | Pools and ReproLoop and always created on start, so there is no need to support lazy set for them. It only complicates code and makes it harder to reason about. Also introduce vm.Dispatcher as an alias to dispatcher.Pool, as it's the only specialization we use in the project.
* pkg/repro: adjust log levelsAleksandr Nogikh2024-10-251-4/+4
| | | | | Some of the levels were just too high, especially considering that the messages are printed via log.Logf().
* pkg/repro: add a fast modeAleksandr Nogikh2024-10-251-23/+51
| | | | | It's to be used in case a quick reproducer is necessary. It omits C repro generation and a number of option simplifications.
* executor: better handling for hanged test processesDmitry Vyukov2024-10-241-0/+2
| | | | | | | | | | | | | | | Currently we kill hanged processes and consider the corresponding test finished. We don't kill/wait for the actual test subprocess (we don't know its pid to kill, and waiting will presumably hang). This has 2 problems: 1. If the hanged process causes "task hung" report, we can't reproduce it, since the test finished too long ago (manager thinks its finished and discards the request). 2. The test process still consumed per-pid resources. Explicitly detect and handle such cases: Manager keeps these hanged tests forever, and we assign a new proc id for future processes (don't reuse the hanged one).
* pkg/repro: be strict about titles during opt simplificationsAleksandr Nogikh2024-08-271-38/+48
| | | | | | | | | | Ideally, we should be mindful of that during the whole repro process, but there's always a chance that different titles are the manifestations of the same problem. So let's stay tolerant to different titles during prog extraction and minimization, but carefully check them during opt simplifications and C repro extraction.
* pkg/repro: don't exaggerate timeoutsAleksandr Nogikh2024-08-271-28/+34
| | | | | | | Our largest timeout is 6 minutes, so anything between 1.5 minutes and 6 ended up having a 9 minute timeout. That's too much. Consider the time it actually took to crash the kernel.
* pkg/repro: increase the minimum testing timeoutsAleksandr Nogikh2024-08-271-2/+2
| | | | | 15 seconds is an unreasonably small timeout. Let's do at least 30 seconds first, then at least 100 seconds.
* pkg/repro: consider the executor info from the crash reportAleksandr Nogikh2024-08-271-32/+63
| | | | | | 1) If we know the tentative reproducer, try only it before the bisection. It's the best single candidate program. 2) During bisection, never drop the program.
* prog: replace MinimizeParams with MinimizeModeDmitry Vyukov2024-08-071-17/+14
| | | | | | | | | | | | | | All callers shouldn't control lots of internal details of minimization (if we have more params, that's just more variations to test, and we don't have more, params is just a more convoluted way to say if we minimize for corpus or a crash). 2 bools also allow to express 4 options, but only 3 make sense. Also when I see MinimizeParams{} in the code, it's unclear what it means. Replace params with mode. And potentially "crash" minimization is not "light", it's just different. E.g. we can simplify int arguments for reproducers (esp in snapshot mode), but we don't need that for corpus.
* syz-manager: move fullReproLog() to pkg/reproAleksandr Nogikh2024-08-061-0/+10
|
* syz-manager: still ignore log parse problemsAleksandr Nogikh2024-07-171-1/+3
| | | | | It seems that this error may come up in absolutely valid and reasonable cases. Restore the special casing.
* syz-manager: refactor empty crash log errorsAleksandr Nogikh2024-07-171-3/+1
| | | | | Now that we do not take the programs from the SSH-based logs, the error does look surprising, so let's print it with log.Errorf().
* pkg/repro: don't minimize to 0 callsAleksandr Nogikh2024-07-151-0/+5
| | | | | | Minimizing to 0 calls leads to an empty execution log, which leads to an immediate exit of tools/syz-execprog, which would be recognized as "lost connection to machine".
* all: transition to instance.PoolAleksandr Nogikh2024-07-113-162/+118
| | | | | Rely on instance.Pool to perform fuzzing and do bug reproductions. Extract the reproduction queue logic to separate testable class.
* pkg/repro: avoid hitting the prog.MaxCalls limitAleksandr Nogikh2024-05-272-15/+94
| | | | | | | | | | When we combine the progs found during prog bisection, there's a chance that we may exceed the prog.MaxCalls limit. In that case, we get a SYZFATAL error and proceed as if it were the target crash. That's absolutely wrong. Let's first minimize each single program before concatenating them, that should work for almost all cases.
* prog: make minimization parameters explicitAleksandr Nogikh2024-05-271-1/+1
| | | | Add an explicit parameter to only run call removal.
* pkg/instance: use execprog to do basic instance testingDmitry Vyukov2024-05-273-5/+8
| | | | | | | | | | | When we accept new kernels for fuzzing we need more extensive testing, but syz-ci switched to using syz-manager for this purpose. Now instance testing is used only for bisection and patch testing, which does not need such extensive image testing (it may even harm). So just run a simple program as a testing. It also uses the same features as the target reproducer, so e.g. if the reproducer does not use wifi, we won't test it, which reduces changes of unrelated kernel bugs.
* pkg/csource: remove the Repro optionAleksandr Nogikh2024-05-171-1/+0
| | | | Enable it unconditionally.
* pkg/repro: don't clear the Repro flagAleksandr Nogikh2024-05-171-5/+0
| | | | | If C reproducers keep on printing "executing program" lines, it will be easier to re-use them during the repro and patch testing.
* pkg/repro, pkg/ipc: use flatrpc.FeatureDmitry Vyukov2024-05-062-33/+32
| | | | | | | Start switching from host.Features to flatrpc.Features. This change is supposed to be a no-op, just to reduce future diffs that will change how we obtain features.
* vm: combine Run and MonitorExecutionDmitry Vyukov2024-04-111-1/+1
| | | | | | All callers of Run always call MonitorExecution right after it. Combine these 2 methods. This allows to hide some implementation details and simplify users of vm package.
* pkg/repro: check reproducibility before bisectingAleksandr Nogikh2024-02-221-1/+11
| | | | | | In many cases bisection does not seem to bring any results, but it takes quite a while to run. Let's save some time by running the whole log before the proces.
* pkg/repro: abort after 6 chunksAleksandr Nogikh2024-02-052-2/+2
| | | | It feels like 8 is a bit too high number. Let's stop reproduction earlier.
* pkg/repro: fix VM creationDmitry Vyukov2024-01-091-0/+1
|
* pkg/repro: fix potential deadlockDmitry Vyukov2024-01-081-13/+5
| | | | | | | | If an instance fails to boot 3 times in a row, we drop it on the floor. If we drop all instances this way, the repro process deadlocks infinitly. Retry infinitly instead. There is nothing else good we can do in this case. If the instances becomes alive again, repro will resume.
* pkg/repro: make instance falure log less confusingAleksandr Nogikh2023-09-111-1/+2
| | | | | Now it looks like a failure of the whole reproduction process. Adjust the message to reduce confusion.
* all: use special placeholder for errorsTaras Madan2023-07-241-2/+2
|
* pkg/repro: tolerate two consequential run errorsAleksandr Nogikh2023-07-202-4/+10
| | | | | | | | | | | | | | | | | | Retrying once has greatly reduced the number of "failed to copy prog to VM" errors, but they still periodically pop up. The underlying problem is still not 100% known. Supposedly, if a booted VM with an instrumented kernel has to wait too long, it can just hang or crash by itself. At least on some problematic revisions. Investigation wouldbe quite time-consuming -- we need to do a complicated refactoring in order to also capture serial output for Copy() failures. So far it does not seem to be totally worth it. Let's do 3 runOnInstance() attempts. If the problem still persists, there's no point in doing more runs -- we'd have to determine the exact root cause.
* pkg/report: request VMs outside of createInstances()Aleksandr Nogikh2023-07-121-26/+28
| | | | | | | | | | | In the current code, there's a possibility that we write to ctx.bootRequests after it was quickly closed. That could happen when we immediately abort the reproduction process after it's started. To avoid this, don't send elements over the bootRequests channel in the createInstances() function. Hopefully closes #4016.
* pkg/repro: use the generic minimization algorithmAleksandr Nogikh2023-07-072-109/+25
|
* pkg/report: move report.Type to pkg/report/crashAleksandr Nogikh2023-07-051-7/+9
| | | | | This will help avoid a circular dependency pkg/vcs -> pkg/report -> pkg/vcs.
* pkg/report: extract more report types for LinuxAleksandr Nogikh2023-07-051-1/+1
| | | | Amend oops and oopsFormat to contain report type.
* all: support swap feature on LinuxAleksandr Nogikh2023-06-151-0/+11
| | | | | If the feature is supported on the device, allocate a 128MB swap file after VM boot and activate it.
* syz-manager: don't report all pkg/repro errors on shutdownAleksandr Nogikh2023-06-091-1/+3
| | | | | Otherwise we're getting "repro failed: all VMs failed to boot" pkg/repro errors if a sykaller instance is shutting down.
* pkg/repro: retry on ExecProgInstance errorsAleksandr Nogikh2023-05-252-15/+77
| | | | | | | Most of those errors seem to be transient, so there's no sense to fail the whole C repro generation process. Give it one more chance and only fail after that.
* pkg/report: restructure the shutdown processAleksandr Nogikh2023-05-252-8/+14
| | | | | | | Only use ctx.bootRequests to indicate that no further VMs are needed. Do not return from Run() until we have fully stopped the VM creation loop as there's a risk it might interfere with fuzzing.
* pkg/repro: add a smoke testAleksandr Nogikh2023-05-252-6/+117
| | | | | This is a sanity test for the overall pkg/repro machinery. It does not focus on minor corner cases.
* pkg/repro: factor out an interfaceAleksandr Nogikh2023-05-251-8/+15
| | | | | Interact with a syz-execprog instance via an additional interface. This will simplify testing.
* pkg/repro: refactor the Run() methodAleksandr Nogikh2023-05-251-45/+64
| | | | | | Split Run() into several functions to facilitate testing. This commit does not introduce any functional changes.
* pkg/report: don't record error for empty repro log caseAleksandr Nogikh2023-05-251-1/+4
| | | | | It's not entirely normal, but it can still happen and it's not a big problem by itself. Let's not pollute our error logs.
* pkg/testutil: add RandSource helperDmitry Vyukov2022-11-231-9/+2
| | | | | The code to send rand source is dublicated in several packages. Move it to testutil package.
* executor: add NIC PCI pass-through VF supportGeorge Kennedy2022-09-211-0/+11
| | | | | | | | | | | | | | | Add support for moving a NIC PCI pass-through VF into Syzkaller's network namespace so that it will tested. As DEVLINK support is triggered by setting the pass-through device to "addr=0x10", NIC PCI pass-through VF support will be triggered by setting the device to "addr=0x11". If a NIC PCI pass-through VF is detected in do_sandbox, setup a staging namespace before the fork() and transfer the NIC VF interface to it. After the fork() and in the child transfer the NIC VF interface to Syzkaller's network namespace and rename the interface to netpci0 so that it will be tested. Signed-off-by: George Kennedy <george.kennedy@oracle.com>