Files
runc/vendor/golang.org/x/sys/unix/syscall_dragonfly.go
Aleksa Sarai 7a8d7162f9 seccomp: prepend -ENOSYS stub to all filters
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

 1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
       specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
       added, and has no API for "minimum kernel version" so we cannot
       simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
       distinct rules -- if rules overlap, seccomp will merge them. This
       means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
       single argument, making it impossible to invert OR rules for
       arguments.

 2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
       creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
       (weird syscalls that are no longer used were excluded entirely,
       but Docker et al expect those syscalls to get EPERM not ENOSYS).

 3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

 4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

 5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.

Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.

The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-28 23:11:22 +11:00

542 lines
14 KiB
Go

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// DragonFly BSD system calls.
// This file is compiled as ordinary Go code,
// but it is also input to mksyscall,
// which parses the //sys lines and generates system call stubs.
// Note that sometimes we use a lowercase //sys name and wrap
// it in our own nicer implementation, either here or in
// syscall_bsd.go or syscall_unix.go.
package unix
import (
"sync"
"unsafe"
)
// See version list in https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/sys/sys/param.h
var (
osreldateOnce sync.Once
osreldate uint32
)
// First __DragonFly_version after September 2019 ABI changes
// http://lists.dragonflybsd.org/pipermail/users/2019-September/358280.html
const _dragonflyABIChangeVersion = 500705
func supportsABI(ver uint32) bool {
osreldateOnce.Do(func() { osreldate, _ = SysctlUint32("kern.osreldate") })
return osreldate >= ver
}
// SockaddrDatalink implements the Sockaddr interface for AF_LINK type sockets.
type SockaddrDatalink struct {
Len uint8
Family uint8
Index uint16
Type uint8
Nlen uint8
Alen uint8
Slen uint8
Data [12]int8
Rcf uint16
Route [16]uint16
raw RawSockaddrDatalink
}
func anyToSockaddrGOOS(fd int, rsa *RawSockaddrAny) (Sockaddr, error) {
return nil, EAFNOSUPPORT
}
// Translate "kern.hostname" to []_C_int{0,1,2,3}.
func nametomib(name string) (mib []_C_int, err error) {
const siz = unsafe.Sizeof(mib[0])
// NOTE(rsc): It seems strange to set the buffer to have
// size CTL_MAXNAME+2 but use only CTL_MAXNAME
// as the size. I don't know why the +2 is here, but the
// kernel uses +2 for its own implementation of this function.
// I am scared that if we don't include the +2 here, the kernel
// will silently write 2 words farther than we specify
// and we'll get memory corruption.
var buf [CTL_MAXNAME + 2]_C_int
n := uintptr(CTL_MAXNAME) * siz
p := (*byte)(unsafe.Pointer(&buf[0]))
bytes, err := ByteSliceFromString(name)
if err != nil {
return nil, err
}
// Magic sysctl: "setting" 0.3 to a string name
// lets you read back the array of integers form.
if err = sysctl([]_C_int{0, 3}, p, &n, &bytes[0], uintptr(len(name))); err != nil {
return nil, err
}
return buf[0 : n/siz], nil
}
func direntIno(buf []byte) (uint64, bool) {
return readInt(buf, unsafe.Offsetof(Dirent{}.Fileno), unsafe.Sizeof(Dirent{}.Fileno))
}
func direntReclen(buf []byte) (uint64, bool) {
namlen, ok := direntNamlen(buf)
if !ok {
return 0, false
}
return (16 + namlen + 1 + 7) &^ 7, true
}
func direntNamlen(buf []byte) (uint64, bool) {
return readInt(buf, unsafe.Offsetof(Dirent{}.Namlen), unsafe.Sizeof(Dirent{}.Namlen))
}
//sysnb pipe() (r int, w int, err error)
func Pipe(p []int) (err error) {
if len(p) != 2 {
return EINVAL
}
p[0], p[1], err = pipe()
return
}
//sysnb pipe2(p *[2]_C_int, flags int) (err error)
func Pipe2(p []int, flags int) error {
if len(p) != 2 {
return EINVAL
}
var pp [2]_C_int
err := pipe2(&pp, flags)
p[0] = int(pp[0])
p[1] = int(pp[1])
return err
}
//sys extpread(fd int, p []byte, flags int, offset int64) (n int, err error)
func Pread(fd int, p []byte, offset int64) (n int, err error) {
return extpread(fd, p, 0, offset)
}
//sys extpwrite(fd int, p []byte, flags int, offset int64) (n int, err error)
func Pwrite(fd int, p []byte, offset int64) (n int, err error) {
return extpwrite(fd, p, 0, offset)
}
func Accept4(fd, flags int) (nfd int, sa Sockaddr, err error) {
var rsa RawSockaddrAny
var len _Socklen = SizeofSockaddrAny
nfd, err = accept4(fd, &rsa, &len, flags)
if err != nil {
return
}
if len > SizeofSockaddrAny {
panic("RawSockaddrAny too small")
}
sa, err = anyToSockaddr(fd, &rsa)
if err != nil {
Close(nfd)
nfd = 0
}
return
}
//sys Getcwd(buf []byte) (n int, err error) = SYS___GETCWD
func Getfsstat(buf []Statfs_t, flags int) (n int, err error) {
var _p0 unsafe.Pointer
var bufsize uintptr
if len(buf) > 0 {
_p0 = unsafe.Pointer(&buf[0])
bufsize = unsafe.Sizeof(Statfs_t{}) * uintptr(len(buf))
}
r0, _, e1 := Syscall(SYS_GETFSSTAT, uintptr(_p0), bufsize, uintptr(flags))
n = int(r0)
if e1 != 0 {
err = e1
}
return
}
func setattrlistTimes(path string, times []Timespec, flags int) error {
// used on Darwin for UtimesNano
return ENOSYS
}
//sys ioctl(fd int, req uint, arg uintptr) (err error)
//sys sysctl(mib []_C_int, old *byte, oldlen *uintptr, new *byte, newlen uintptr) (err error) = SYS___SYSCTL
func sysctlUname(mib []_C_int, old *byte, oldlen *uintptr) error {
err := sysctl(mib, old, oldlen, nil, 0)
if err != nil {
// Utsname members on Dragonfly are only 32 bytes and
// the syscall returns ENOMEM in case the actual value
// is longer.
if err == ENOMEM {
err = nil
}
}
return err
}
func Uname(uname *Utsname) error {
mib := []_C_int{CTL_KERN, KERN_OSTYPE}
n := unsafe.Sizeof(uname.Sysname)
if err := sysctlUname(mib, &uname.Sysname[0], &n); err != nil {
return err
}
uname.Sysname[unsafe.Sizeof(uname.Sysname)-1] = 0
mib = []_C_int{CTL_KERN, KERN_HOSTNAME}
n = unsafe.Sizeof(uname.Nodename)
if err := sysctlUname(mib, &uname.Nodename[0], &n); err != nil {
return err
}
uname.Nodename[unsafe.Sizeof(uname.Nodename)-1] = 0
mib = []_C_int{CTL_KERN, KERN_OSRELEASE}
n = unsafe.Sizeof(uname.Release)
if err := sysctlUname(mib, &uname.Release[0], &n); err != nil {
return err
}
uname.Release[unsafe.Sizeof(uname.Release)-1] = 0
mib = []_C_int{CTL_KERN, KERN_VERSION}
n = unsafe.Sizeof(uname.Version)
if err := sysctlUname(mib, &uname.Version[0], &n); err != nil {
return err
}
// The version might have newlines or tabs in it, convert them to
// spaces.
for i, b := range uname.Version {
if b == '\n' || b == '\t' {
if i == len(uname.Version)-1 {
uname.Version[i] = 0
} else {
uname.Version[i] = ' '
}
}
}
mib = []_C_int{CTL_HW, HW_MACHINE}
n = unsafe.Sizeof(uname.Machine)
if err := sysctlUname(mib, &uname.Machine[0], &n); err != nil {
return err
}
uname.Machine[unsafe.Sizeof(uname.Machine)-1] = 0
return nil
}
func Sendfile(outfd int, infd int, offset *int64, count int) (written int, err error) {
if raceenabled {
raceReleaseMerge(unsafe.Pointer(&ioSync))
}
return sendfile(outfd, infd, offset, count)
}
/*
* Exposed directly
*/
//sys Access(path string, mode uint32) (err error)
//sys Adjtime(delta *Timeval, olddelta *Timeval) (err error)
//sys Chdir(path string) (err error)
//sys Chflags(path string, flags int) (err error)
//sys Chmod(path string, mode uint32) (err error)
//sys Chown(path string, uid int, gid int) (err error)
//sys Chroot(path string) (err error)
//sys Close(fd int) (err error)
//sys Dup(fd int) (nfd int, err error)
//sys Dup2(from int, to int) (err error)
//sys Exit(code int)
//sys Faccessat(dirfd int, path string, mode uint32, flags int) (err error)
//sys Fchdir(fd int) (err error)
//sys Fchflags(fd int, flags int) (err error)
//sys Fchmod(fd int, mode uint32) (err error)
//sys Fchmodat(dirfd int, path string, mode uint32, flags int) (err error)
//sys Fchown(fd int, uid int, gid int) (err error)
//sys Fchownat(dirfd int, path string, uid int, gid int, flags int) (err error)
//sys Flock(fd int, how int) (err error)
//sys Fpathconf(fd int, name int) (val int, err error)
//sys Fstat(fd int, stat *Stat_t) (err error)
//sys Fstatat(fd int, path string, stat *Stat_t, flags int) (err error)
//sys Fstatfs(fd int, stat *Statfs_t) (err error)
//sys Fsync(fd int) (err error)
//sys Ftruncate(fd int, length int64) (err error)
//sys Getdents(fd int, buf []byte) (n int, err error)
//sys Getdirentries(fd int, buf []byte, basep *uintptr) (n int, err error)
//sys Getdtablesize() (size int)
//sysnb Getegid() (egid int)
//sysnb Geteuid() (uid int)
//sysnb Getgid() (gid int)
//sysnb Getpgid(pid int) (pgid int, err error)
//sysnb Getpgrp() (pgrp int)
//sysnb Getpid() (pid int)
//sysnb Getppid() (ppid int)
//sys Getpriority(which int, who int) (prio int, err error)
//sysnb Getrlimit(which int, lim *Rlimit) (err error)
//sysnb Getrusage(who int, rusage *Rusage) (err error)
//sysnb Getsid(pid int) (sid int, err error)
//sysnb Gettimeofday(tv *Timeval) (err error)
//sysnb Getuid() (uid int)
//sys Issetugid() (tainted bool)
//sys Kill(pid int, signum syscall.Signal) (err error)
//sys Kqueue() (fd int, err error)
//sys Lchown(path string, uid int, gid int) (err error)
//sys Link(path string, link string) (err error)
//sys Linkat(pathfd int, path string, linkfd int, link string, flags int) (err error)
//sys Listen(s int, backlog int) (err error)
//sys Lstat(path string, stat *Stat_t) (err error)
//sys Mkdir(path string, mode uint32) (err error)
//sys Mkdirat(dirfd int, path string, mode uint32) (err error)
//sys Mkfifo(path string, mode uint32) (err error)
//sys Mknod(path string, mode uint32, dev int) (err error)
//sys Mknodat(fd int, path string, mode uint32, dev int) (err error)
//sys Nanosleep(time *Timespec, leftover *Timespec) (err error)
//sys Open(path string, mode int, perm uint32) (fd int, err error)
//sys Openat(dirfd int, path string, mode int, perm uint32) (fd int, err error)
//sys Pathconf(path string, name int) (val int, err error)
//sys read(fd int, p []byte) (n int, err error)
//sys Readlink(path string, buf []byte) (n int, err error)
//sys Rename(from string, to string) (err error)
//sys Renameat(fromfd int, from string, tofd int, to string) (err error)
//sys Revoke(path string) (err error)
//sys Rmdir(path string) (err error)
//sys Seek(fd int, offset int64, whence int) (newoffset int64, err error) = SYS_LSEEK
//sys Select(nfd int, r *FdSet, w *FdSet, e *FdSet, timeout *Timeval) (n int, err error)
//sysnb Setegid(egid int) (err error)
//sysnb Seteuid(euid int) (err error)
//sysnb Setgid(gid int) (err error)
//sys Setlogin(name string) (err error)
//sysnb Setpgid(pid int, pgid int) (err error)
//sys Setpriority(which int, who int, prio int) (err error)
//sysnb Setregid(rgid int, egid int) (err error)
//sysnb Setreuid(ruid int, euid int) (err error)
//sysnb Setresgid(rgid int, egid int, sgid int) (err error)
//sysnb Setresuid(ruid int, euid int, suid int) (err error)
//sysnb Setrlimit(which int, lim *Rlimit) (err error)
//sysnb Setsid() (pid int, err error)
//sysnb Settimeofday(tp *Timeval) (err error)
//sysnb Setuid(uid int) (err error)
//sys Stat(path string, stat *Stat_t) (err error)
//sys Statfs(path string, stat *Statfs_t) (err error)
//sys Symlink(path string, link string) (err error)
//sys Symlinkat(oldpath string, newdirfd int, newpath string) (err error)
//sys Sync() (err error)
//sys Truncate(path string, length int64) (err error)
//sys Umask(newmask int) (oldmask int)
//sys Undelete(path string) (err error)
//sys Unlink(path string) (err error)
//sys Unlinkat(dirfd int, path string, flags int) (err error)
//sys Unmount(path string, flags int) (err error)
//sys write(fd int, p []byte) (n int, err error)
//sys mmap(addr uintptr, length uintptr, prot int, flag int, fd int, pos int64) (ret uintptr, err error)
//sys munmap(addr uintptr, length uintptr) (err error)
//sys readlen(fd int, buf *byte, nbuf int) (n int, err error) = SYS_READ
//sys writelen(fd int, buf *byte, nbuf int) (n int, err error) = SYS_WRITE
//sys accept4(fd int, rsa *RawSockaddrAny, addrlen *_Socklen, flags int) (nfd int, err error)
//sys utimensat(dirfd int, path string, times *[2]Timespec, flags int) (err error)
/*
* Unimplemented
* TODO(jsing): Update this list for DragonFly.
*/
// Profil
// Sigaction
// Sigprocmask
// Getlogin
// Sigpending
// Sigaltstack
// Reboot
// Execve
// Vfork
// Sbrk
// Sstk
// Ovadvise
// Mincore
// Setitimer
// Swapon
// Select
// Sigsuspend
// Readv
// Writev
// Nfssvc
// Getfh
// Quotactl
// Mount
// Csops
// Waitid
// Add_profil
// Kdebug_trace
// Sigreturn
// Atsocket
// Kqueue_from_portset_np
// Kqueue_portset
// Getattrlist
// Setattrlist
// Getdirentriesattr
// Searchfs
// Delete
// Copyfile
// Watchevent
// Waitevent
// Modwatch
// Getxattr
// Fgetxattr
// Setxattr
// Fsetxattr
// Removexattr
// Fremovexattr
// Listxattr
// Flistxattr
// Fsctl
// Initgroups
// Posix_spawn
// Nfsclnt
// Fhopen
// Minherit
// Semsys
// Msgsys
// Shmsys
// Semctl
// Semget
// Semop
// Msgctl
// Msgget
// Msgsnd
// Msgrcv
// Shmat
// Shmctl
// Shmdt
// Shmget
// Shm_open
// Shm_unlink
// Sem_open
// Sem_close
// Sem_unlink
// Sem_wait
// Sem_trywait
// Sem_post
// Sem_getvalue
// Sem_init
// Sem_destroy
// Open_extended
// Umask_extended
// Stat_extended
// Lstat_extended
// Fstat_extended
// Chmod_extended
// Fchmod_extended
// Access_extended
// Settid
// Gettid
// Setsgroups
// Getsgroups
// Setwgroups
// Getwgroups
// Mkfifo_extended
// Mkdir_extended
// Identitysvc
// Shared_region_check_np
// Shared_region_map_np
// __pthread_mutex_destroy
// __pthread_mutex_init
// __pthread_mutex_lock
// __pthread_mutex_trylock
// __pthread_mutex_unlock
// __pthread_cond_init
// __pthread_cond_destroy
// __pthread_cond_broadcast
// __pthread_cond_signal
// Setsid_with_pid
// __pthread_cond_timedwait
// Aio_fsync
// Aio_return
// Aio_suspend
// Aio_cancel
// Aio_error
// Aio_read
// Aio_write
// Lio_listio
// __pthread_cond_wait
// Iopolicysys
// __pthread_kill
// __pthread_sigmask
// __sigwait
// __disable_threadsignal
// __pthread_markcancel
// __pthread_canceled
// __semwait_signal
// Proc_info
// Stat64_extended
// Lstat64_extended
// Fstat64_extended
// __pthread_chdir
// __pthread_fchdir
// Audit
// Auditon
// Getauid
// Setauid
// Getaudit
// Setaudit
// Getaudit_addr
// Setaudit_addr
// Auditctl
// Bsdthread_create
// Bsdthread_terminate
// Stack_snapshot
// Bsdthread_register
// Workq_open
// Workq_ops
// __mac_execve
// __mac_syscall
// __mac_get_file
// __mac_set_file
// __mac_get_link
// __mac_set_link
// __mac_get_proc
// __mac_set_proc
// __mac_get_fd
// __mac_set_fd
// __mac_get_pid
// __mac_get_lcid
// __mac_get_lctx
// __mac_set_lctx
// Setlcid
// Read_nocancel
// Write_nocancel
// Open_nocancel
// Close_nocancel
// Wait4_nocancel
// Recvmsg_nocancel
// Sendmsg_nocancel
// Recvfrom_nocancel
// Accept_nocancel
// Fcntl_nocancel
// Select_nocancel
// Fsync_nocancel
// Connect_nocancel
// Sigsuspend_nocancel
// Readv_nocancel
// Writev_nocancel
// Sendto_nocancel
// Pread_nocancel
// Pwrite_nocancel
// Waitid_nocancel
// Msgsnd_nocancel
// Msgrcv_nocancel
// Sem_wait_nocancel
// Aio_suspend_nocancel
// __sigwait_nocancel
// __semwait_signal_nocancel
// __mac_mount
// __mac_get_mount
// __mac_getfsstat