/r/kernel

Photograph via snooOG

A moderated community dedicated to technical discussion about the Linux kernel.

Welcome to /r/kernel, a moderated community dedicated to all things about the Linux kernel. Technical articles only, please!

You may be interested in the following links:

And some books:

Related Communities

/r/kernel

17,086 Subscribers

3

What does it mean to mmap() a virtual file?

I have read about how mmap() is better when dealing with large files and how memory does not need to be swapped out, etc.

But, in KVM, the kvm_run structure is mmaped() by specifying the vcpu's file descriptor. The vcpu is not really a file on the disk, but a virtual file with some file operations (fops).

Why is mmap() used and what does it mean in the context of virtual files? (coming from QEMU and kvmtool source code)

2 Comments
2024/06/17
05:21 UTC

8

DMA Engine - How to handle DMA "timeouts"

Hey all,

I'm new to the DMA and DMA Engine APIs in the kernel. I have a hardware device (FPGA custom logic) that works with a DMA. The vendor (Xilinx) supplies a DMA Engine driver and some tests that are very well maintained and received by users. The nature of my custom logic is sort-of like a NIC; data is pushed and pulled via this DMA channel.

Xilinx provides a reference driver on-top of their core DMA driver that does some userspace memory mapping, and provides a chardev interface to make it easy for newbies or do what people most usually want to do; push/pull data between userspace and the kernel. I bring this up since ALL DMA drivers I found (including these prototypes from Xilinx) and various "DMA test" drivers seem to not handle "timeouts" well. I do not plan to use this dma-proxy driver but it exists online and is easy to reference.

To reference the example from Xilinx: here -> dma_proxy.c, when we want to receive data over my DMA channel, it does:

start_transfer() {
    sg_init_table(..., 1);
    sg_dma_address(... ) = foo.dma_handle;
    sg_dma_len(...) = foo.length;
    chan_desc = dma_device->device_prep_slave_sg(..., ..., 1, ..., ..., NULL);
    ...
}

Then waits on the completion:

wait_for_transfer() {
    unsigned long timeout = msecs_to_jiffies(3000);
    timeout = wait_for_completion_timeout(foo.cmp, timeout);
    status = dma_async_is_tx_complete(..., ..., NULL, NULL);

    if (timeout == 0)  {
        printk(KERN_ERR "DMA timed out\n");
    }
    else { ... }
    ...
}

For my specific peripheral/"hardware", when "pulling" from the DMA, data may not be ready (and we may not receive an interrupt).

What I don't understand is how to handle the timeout correctly. Maybe I need to switch the Rx/receive path to polling? It seems like all examples don't ever really expect these DMA slave requests to fail. The result of the timeout is some descriptor (I think chan_desc above) is not being released, so after 3sec * 255 (size of some descriptor list), my DMA device/handle can no longer submit slave requests.

Any advice?

I posted this same question to the kernelnewbies mailing list as well.

Thanks!

0 Comments
2024/06/11
05:25 UTC

4

Why it takes 7 seconds to reserve huge pages at boot?

I have a server with 300G of RAM and I'm preallocating ~192G of it as huge pages by passing hugepagesz=1G hugepages=192 args to kernel command line.

As a result, I see these lines in dmesg:

[    1.276475\] mem auto-init: stack:off, heap alloc:off, heap free:off
[    8.102710\] Memory: 2973348K/301621240K available (14349K kernel code, 9611K rwdata, 8492K rodata, 2516K init, 20268K bss, 206226124K reserved, 0K cma-reserved)

Without such preallocation there is not such gap in timestamps. Anybody has a clue why it takes so much for kernel to reserve huge pages?

4 Comments
2024/06/09
19:56 UTC

1

block device driver: reading does not work

Kernel: 5.15.0-70-generic.

I used to (parameter is_remap=0) in a similar task, upon receiving an input request bio, I formed my request bioto a higher-level device, and everything worked. But slowly. The writing speed to the flash drive was ~460 kb/sec. Then I decided to forward the request bioto the upstream device directly ( is_remap=1). If you do not try to modify the data, then everything works, and the speed increases to 1.8 Mb/sec, i.e. ~ 4 times. But if you start modifying the data (and this is necessary), then only recording works. When reading, dd receives undecrypted data, and bioin stackbd_end_io_read_cloned(previously cloned using bio_clone_fastin stackbd_io_fn_remap) generally has a zero size. In this case, the size obiois non-zero. How does this even happen, and how to do it right?

It’s interesting that if stackbd_end_io_read_clonedyou change the data after the call bio_endio, then ddthe decrypted data arrives, but I feel that doing this is not correct. Which is confirmed by the fact that fsckafter that mkfsthe system crashes.

For example, I read the sector:

user@linux:~/git/stackbd/module$ sudo dd if=/dev/stackbd0 count=1 | hexdump -C
00000000  63 d0 18 e5 e3 ee fb a6  ee e9 fc 88 8a a8 a8 88  |c...............|
00000010  8a 88 88 88 88 70 88 88  98 88 8c 88 88 88 88 88  |.....p..........|
00000020  88 48 26 8b 88 b3 88 88  88 88 88 88 8a 88 88 88  |.H&.............|
00000030  89 88 8e 88 88 88 88 88  88 88 88 88 88 88 88 88  |................|
00000040  08 88 a1 57 55 08 9b c6  c7 a8 c6 c9 c5 cd a8 a8  |...WU...........|
00000050  a8 a8 ce c9 dc bb ba a8  a8 a8 86 97 36 ff f4 24  |............6..$|
00000060  aa 48 fc 83 de 3c 86 33  8f 88 45 98 d6 63 78 ba  |.H...<.3..E..cx.|
00000070  6c 45 9e 45 91 63 76 dc  e0 e1 fb a8 e1 fb a8 e6  |lE.E.cv.........|
00000080  e7 fc a8 e9 a8 ea e7 e7  fc e9 ea e4 ed a8 ec e1  |................|
00000090  fb e3 a6 a8 a8 d8 e4 ed  e9 fb ed a8 e1 e6 fb ed  |................|
000000a0  fa fc a8 e9 a8 ea e7 e7  fc e9 ea e4 ed a8 ee e4  |................|
000000b0  e7 f8 f8 f1 a8 e9 e6 ec  85 82 f8 fa ed fb fb a8  |................|
000000c0  e9 e6 f1 a8 e3 ed f1 a8  fc e7 a8 fc fa f1 a8 e9  |................|
000000d0  ef e9 e1 e6 a8 a6 a6 a6  a8 85 82 88 88 88 88 88  |................|
000000e0  88 88 88 88 88 88 88 88  88 88 88 88 88 88 88 88  |................|
*
000001f0  88 88 88 88 88 88 88 88  88 88 88 88 88 88 dd 22  |..............."|
1+0 records in
1+0 records out
512 bytes copied, 0,0063565 s, 80,5 kB/s
00000200
user@linux:~/git/stackbd/module$

And this is what I see in the log:

kernel: stackbd [task=00000000c60564d5] stackbd_io_fn_remap: HIT.r.1
kernel: debugbd [task=00000000c60564d5] debugbd_submit_bio: debugbd: make request read  block 0            #pages 0    total-size 16384     
kernel: stackbd [task=00000000c60564d5] stackbd_io_fn_remap: HIT.r.2
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.1
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.2: obio.size=16384; bio.size=0
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.3
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.4
kernel: stackbd [task=00000000c60564d5] stackbd_io_fn_remap: HIT.r.1
kernel: debugbd [task=00000000c60564d5] debugbd_submit_bio: debugbd: make request read  block 32           #pages 0    total-size 32768     
kernel: stackbd [task=00000000c60564d5] stackbd_io_fn_remap: HIT.r.2
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.1
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.2: obio.size=32768; bio.size=0
kernel: stackbd [task=0000000089abc07d] stackbd_end_io_read_cloned: HIT.3

debugbd is the same driver, but displays information about requests for debugging.

stackbd driver source code:

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>

#include <linux/version.h>
#include <linux/kernel.h> // printk()
#include <linux/fs.h>     // everything...
#include <linux/errno.h>  // error codes
#include <linux/types.h>  // size_t
#include <linux/vmalloc.h>
#include <linux/genhd.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/kthread.h>

#include <trace/events/block.h>

#include "logging.h"
#include "../common/stackbd.h"

#define STACKBD_BDEV_MODE (FMODE_READ | FMODE_WRITE | FMODE_EXCL)

#define KERNEL_SECTOR_SHIFT 9
#define KERNEL_SECTOR_SIZE (1 << KERNEL_SECTOR_SHIFT)

#define DECLARE_BIO_VEC struct bio_vec
#define ACCESS_BIO_VEC(x) (x)
#define DECLARE_BVEC_ITER struct bvec_iter
#define BIO_SET_SECTOR(bio, sec) (bio)->bi_iter.bi_sector = (sec)
#define BIO_GET_SECTOR(bio) (bio)->bi_iter.bi_sector
#define BIO_GET_SIZE(bio) (bio)->bi_iter.bi_size
#define BIO_SET_BDEV(bio, bdev) bio_set_dev((bio), (bdev));

//#ifdef CONFIG_LBDAF
#define SEC_FMT "llu"
//#else
//#define SEC_FMT "lu"
//#endif

MODULE_LICENSE("Dual BSD/GPL");

static int major_num = 0;
module_param(major_num, int, 0);
static int LOGICAL_BLOCK_SIZE = 512;
module_param(LOGICAL_BLOCK_SIZE, int, 0);
static bool is_remap = false;
module_param(is_remap, bool, 0);

typedef struct
{
	char path[PATH_MAX];
    fmode_t mode;
	bool is_bdev_raw_ok;
	struct block_device *bdev_raw;
} stackbd_target_t;

/*
 * The internal representation of our device.
 */
static struct stackbd_t {
    sector_t capacity; /* Sectors */
    struct gendisk *gd;
    spinlock_t lock;
    struct bio_list bio_list;
    struct task_struct *thread;
    int is_active;
    stackbd_target_t tgt;
    /* Our request queue */
    struct request_queue *queue;
} stackbd;

static DECLARE_WAIT_QUEUE_HEAD(req_event);

typedef void (* t_stackbd_io_fn)(struct bio *);
static t_stackbd_io_fn p_stackbd_io_fn = NULL;
static struct bio_set bs;

int buffer_read(
	struct stackbd_t *dev,
    unsigned long sector,
    unsigned long nsect,
    char *buffer
)
{
    int result = 0;
    unsigned nsize = nsect << KERNEL_SECTOR_SHIFT;
    int npages = ((nsize - 1) >> PAGE_SHIFT) + 1;
    struct bio *bio;
    struct block_device *bdev = dev->tgt.bdev_raw;

    //PINFO("begin; sector=%ld; nsect=%ld; buffer=%p\n", sector, nsect, buffer);

    if(unlikely(!dev->tgt.is_bdev_raw_ok))
    {
        PERROR("bdev is NULL!\n");
        result = -EFAULT;
        goto out;
    }

    bio = bio_alloc(GFP_NOIO, npages);

    if(unlikely(!bio))
    {
        PERROR("bio_alloc failed!\n");
        result = -ENOMEM;
        goto out;
    }

    BIO_SET_BDEV(bio, bdev);
    BIO_SET_SECTOR(bio, sector);

    bio_set_op_attrs(bio, REQ_OP_READ, REQ_PREFLUSH);

    {
        char *ptr = buffer;
        do
        {
            struct page *page;
            page = virt_to_page(ptr);
            if(unlikely(!page))
            {
                PERROR("virt_to_page failed!\n");
                result = -ENOMEM;
                break;
            }

            {
                unsigned op = offset_in_page(ptr);
                unsigned this_step = min((unsigned)(PAGE_SIZE - op), nsize);
                bio_add_page(bio, page, this_step, op);
                nsize -= this_step;
                ptr += this_step;
            }
        } while(nsize > 0);

        if(likely(!result))
        {
            result = submit_bio_wait(bio);
        }
        bio_put(bio);
    }
out:
    //PINFO("end (%d)\n", result);
    return result;
}

int buffer_write(
    struct stackbd_t *dev,
    unsigned long sector,
    unsigned long nsect,
    char *buffer
)
{
    int result = 0;
    unsigned nsize = nsect << KERNEL_SECTOR_SHIFT;
    int npages = ((nsize - 1) >> PAGE_SHIFT) + 1;
    struct bio *bio;
    struct block_device *bdev = dev->tgt.bdev_raw;

    //PINFO("begin; sector=%ld; nsect=%ld; buffer=%p\n", sector, nsect, buffer);

    if(unlikely(!dev->tgt.is_bdev_raw_ok))
    {
        PERROR("bdev is NULL!\n");
        result = -EFAULT;
        goto out;
    }

    bio = bio_alloc(GFP_NOIO, npages);
    if(unlikely(!bio))
    {
        PERROR("bio_alloc failed!\n");
        result = -ENOMEM;
        goto out;
    }
    BIO_SET_BDEV(bio, bdev);
    BIO_SET_SECTOR(bio, sector);

    bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_PREFLUSH);

    {
        char *ptr = buffer;
        do
        {
            struct page *page = virt_to_page(ptr);

            if(unlikely(!page))
            {
                PERROR("alloc page failed!\n");
                result = -ENOMEM;
                break;
            }

            {
                unsigned op = offset_in_page(ptr);
                unsigned this_step = min((unsigned)(PAGE_SIZE - op), nsize);
                bio_add_page(bio, page, this_step, op);
                nsize -= this_step;
                ptr += this_step;
            }
        } while(nsize > 0);

        if(likely(!result))
        {
	#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 8, 0)
            result = submit_bio_wait(bio);
	#else
            result = submit_bio_wait(WRITE | REQ_FLUSH, bio);
	#endif
        }
        bio_put(bio);
    }
out:
    //PINFO("end (%d)\n", result);
    return result;
}

static void stackbd_end_io_read_cloned(struct bio *bio)
{
    struct bio *obio = bio->bi_private;
    PINFO("HIT.1");
    if (bio_data_dir(bio) == READ)
    {
        DECLARE_BIO_VEC bvec;
        DECLARE_BVEC_ITER iter;

        PINFO("HIT.2: obio.size=%u; bio.size=%u", BIO_GET_SIZE(obio), BIO_GET_SIZE(bio));

        bio_for_each_segment(bvec, bio, iter)
        {
            char *p = page_address(ACCESS_BIO_VEC(bvec).bv_page) + ACCESS_BIO_VEC(bvec).bv_offset;
            int len = ACCESS_BIO_VEC(bvec).bv_len;
            int i;

            print_hex_dump(KERN_INFO, "readed data (1-st 16 bytes) ", DUMP_PREFIX_OFFSET, 16, 1, p, 16, false);

            for(i = 0; i < len; i++)
            {
                //*p++ ^= 0x12345678;
                *p++ ^= 0x88;
            }

            //p += len;
        }
        PINFO("HIT.3");
        bio_put(bio);
        bio_endio(obio);
    }
    else
    {
        bio_put(bio);
        bio_endio(obio);
    }
    //bio_put(bio);
    PINFO("HIT.4");
}

static void stackbd_io_fn_remap(struct bio *bio)
{
    DECLARE_BIO_VEC bvec;
    DECLARE_BVEC_ITER iter;
    struct bio *cbio = bio_clone_fast(bio, GFP_NOIO, &bs);

    BIO_SET_BDEV(cbio, stackbd.tgt.bdev_raw);
    cbio->bi_end_io = stackbd_end_io_read_cloned;
    cbio->bi_private = bio;
    //submit_bio_noacct(cbio);

    //trace_block_bio_remap(/*bdev_get_queue(stackbd.bdev_raw), */bio,
    //    stackbd.tgt.bdev_raw->bd_dev, BIO_GET_SECTOR(bio));

    if (bio_data_dir(bio) == READ)
    {
        PINFO("HIT.r.1");
        submit_bio_noacct(cbio);
        PINFO("HIT.r.2");
    }
    else
    {
        PINFO("HIT.w.1");
        bio_for_each_segment(bvec, cbio, iter)
        {
            char *p = page_address(ACCESS_BIO_VEC(bvec).bv_page) + ACCESS_BIO_VEC(bvec).bv_offset;
            int len = ACCESS_BIO_VEC(bvec).bv_len;
            int i;

            for(i = 0; i < len; i++)
            {
                // *p++ ^= 0x12345678;
                *p++ ^= 0x88;
            }

            print_hex_dump(KERN_INFO, "writed data (1-st 16 bytes) ", DUMP_PREFIX_OFFSET, 16, 1, p, 16, false);

            //p += len;
        }
        PINFO("HIT.w.2");
        submit_bio_noacct(cbio);
        PINFO("HIT.w.3");
    }
}

static void my_bio_complete(struct bio *bio, int ret)
{
    if (ret)
        bio_io_error(bio);
    else
        bio_endio(bio);
}

static void stackbd_io_fn_clone(struct bio *bio)
{
    int res;
    DECLARE_BIO_VEC bvec;
    DECLARE_BVEC_ITER iter;
    sector_t sector = BIO_GET_SECTOR(bio);
    int size = BIO_GET_SIZE(bio);
    int nsect = size >> KERNEL_SECTOR_SHIFT;
    char *src, *p;

    do
    {
		if (bio_data_dir(bio) == READ)
		{
			p = src = kmalloc(size, GFP_KERNEL);
			if (!src)
			{
				PERROR("Unable to allocate read buffer!\n");
				res = -ENOMEM;
				break;
			}

			do
			{
				res = buffer_read(&stackbd, sector, nsect, src);
				if (unlikely(res))
				{
					PERROR("i/o error while read!\n");
					break;
				}

				bio_for_each_segment(bvec, bio, iter)
				{
					char *dst = page_address(ACCESS_BIO_VEC(bvec).bv_page) + ACCESS_BIO_VEC(bvec).bv_offset;
					int len = ACCESS_BIO_VEC(bvec).bv_len;
					memcpy(dst, p, len);
					p += len;
				}
			}
			while (0);
		}
		else
		{
			p = src = kmalloc(size, GFP_KERNEL);
			if (!src)
			{
				PERROR("Unable to allocate write buffer!\n");
				res = -ENOMEM;
				break;
			}

			bio_for_each_segment(bvec, bio, iter)
			{
				char *dst = page_address(ACCESS_BIO_VEC(bvec).bv_page) + ACCESS_BIO_VEC(bvec).bv_offset;
				int len = ACCESS_BIO_VEC(bvec).bv_len;
				memcpy(p, dst, len);
				p += len;
			}
			res = buffer_write(&stackbd, sector, nsect, src);
			if (unlikely(res))
			{
				PERROR("i/o error while write!\n");
			}
		}
		kfree(src);
    }
    while (0);

    my_bio_complete(bio, res);
} // stackbd_io_fn_clone

static int stackbd_threadfn(void *data)
{
    struct bio *bio;

    set_user_nice(current, -20);

    while (!kthread_should_stop())
    {
        /* wake_up() is after adding bio to list. No need for condition */ 
        wait_event_interruptible(req_event, kthread_should_stop() ||
                !bio_list_empty(&stackbd.bio_list));

        spin_lock_irq(&stackbd.lock);
        if (bio_list_empty(&stackbd.bio_list))
        {
            spin_unlock_irq(&stackbd.lock);
            continue;
        }

        bio = bio_list_pop(&stackbd.bio_list);
        spin_unlock_irq(&stackbd.lock);

        p_stackbd_io_fn(bio);
    }

    return 0;
}

// Handle an I/O request.
static blk_qc_t stackbd_submit_bio(struct bio *bio)
{
    /*PINFO("stackbd: make request %-5s block %-12" SEC_FMT " #pages %-4hu total-size %-10u\n",
        bio_data_dir(bio) == WRITE ? "write" : "read",
        BIO_GET_SECTOR(bio),
        bio->bi_vcnt,
        BIO_GET_SIZE(bio)
    );*/

    spin_lock_irq(&stackbd.lock);
    if (!stackbd.tgt.bdev_raw)
    {
        PERROR("Request before bdev_raw is ready, aborting\n");
        goto abort;
    }
    if (!stackbd.is_active)
    {
        PERROR("Device not active yet, aborting\n");
        goto abort;
    }
    bio_list_add(&stackbd.bio_list, bio);
    wake_up(&req_event);
    spin_unlock_irq(&stackbd.lock);

    goto exit;

abort:
    spin_unlock_irq(&stackbd.lock);
    PERROR("<%p> Abort request\n", bio);
    bio_io_error(bio);
exit:
    return BLK_QC_T_NONE;
}

static int stackbd_target_open(stackbd_target_t *p_tdev)
{
    int res = 0;
    char *path = p_tdev->path;

    PINFO("Open %s\n", path);
    {
        struct block_device *bdev_raw = blkdev_get_by_path(path, p_tdev->mode, p_tdev);
        p_tdev->bdev_raw = bdev_raw;

        if (unlikely(IS_ERR(bdev_raw)))
        {
            res = PTR_ERR(bdev_raw);
            PINFO("error opening raw device %s <%d>\n", path, res);
        }

        p_tdev->is_bdev_raw_ok = !res;
        return res;
    }
}

static void stackbd_target_close(stackbd_target_t *p_tdev)
{
    if (p_tdev->is_bdev_raw_ok)
    {
        blkdev_put(p_tdev->bdev_raw, p_tdev->mode);
        p_tdev->bdev_raw = NULL;
        p_tdev->is_bdev_raw_ok = false;
    }
}

static int stackbd_start(char dev_path[])
{
    unsigned max_sectors;
    sector_t lba;

    stackbd_target_t *p_tgt = &stackbd.tgt;
    strcpy(p_tgt->path, dev_path);
    p_tgt->mode = STACKBD_BDEV_MODE;

    if(stackbd_target_open(p_tgt) < 0)
    {
        PERROR("Error while stackbd_target_open(..)!");
        return -EFAULT;
    }

    /* Set up our internal device */
    lba = i_size_read(p_tgt->bdev_raw->bd_inode) >> KERNEL_SECTOR_SHIFT;

    stackbd.capacity = lba;//get_capacity(stackbd.bdev_raw->bd_disk);
    PINFO("Device real capacity: %" SEC_FMT "\n", stackbd.capacity);

    set_capacity(stackbd.gd, stackbd.capacity);

    max_sectors = queue_max_hw_sectors(bdev_get_queue(p_tgt->bdev_raw));
    blk_queue_max_hw_sectors(stackbd.queue, max_sectors);
    PINFO("Max sectors: %u\n", max_sectors);

    stackbd.thread = kthread_create(stackbd_threadfn, NULL,
           stackbd.gd->disk_name);
    if (IS_ERR(stackbd.thread))
    {
        PERROR("error kthread_create <%lu>\n", PTR_ERR(stackbd.thread));
        goto error_after_bdev;
    }

    PINFO("done initializing successfully\n");
    stackbd.is_active = 1;
    wake_up_process(stackbd.thread);

    return 0;

error_after_bdev:
    stackbd_target_close(p_tgt);

    return -EFAULT;
}

static int stackbd_ioctl(struct block_device *bdev, fmode_t mode,
		     unsigned int cmd, unsigned long arg)
{
    char dev_path[80];
    void __user *argp = (void __user *)arg;

    switch (cmd)
    {
    case STACKBD_DO_IT:
        PINFO("\n*** DO IT!!!!!!! ***\n\n");

        if (copy_from_user(dev_path, argp, sizeof(dev_path)))
            return -EFAULT;

        return stackbd_start(dev_path);
    default:
        return -ENOTTY;
    }
}

/*
 * The HDIO_GETGEO ioctl is handled in blkdev_ioctl(), which
 * calls this. We need to implement getgeo, since we can't
 * use tools such as fdisk to partition the drive otherwise.
 */
int stackbd_getgeo(struct block_device * block_device, struct hd_geometry * geo)
{
	long size;

	/* We have no real geometry, of course, so make something up. */
	size = stackbd.capacity * (LOGICAL_BLOCK_SIZE / KERNEL_SECTOR_SIZE);
	geo->cylinders = (size & ~0x3f) >> 6;
	geo->heads = 4;
	geo->sectors = 16;
	geo->start = 0;
	return 0;
}

/*
 * The device operations structure.
 */
static struct block_device_operations stackbd_ops = {
    .owner  = THIS_MODULE,
    .submit_bio = stackbd_submit_bio,
    .getgeo = stackbd_getgeo,
    .ioctl  = stackbd_ioctl,
};

static int __init stackbd_init(void)
{
    PINFO("is_remap=%d\n", is_remap);

    if (is_remap)
    {
        p_stackbd_io_fn = stackbd_io_fn_remap;
    }
    else
    {
        p_stackbd_io_fn = stackbd_io_fn_clone;
    }

    /* Set up our internal device */
    spin_lock_init(&stackbd.lock);

    /* Get registered */
    if ((major_num = register_blkdev(major_num, STACKBD_NAME)) < 0)
    {
        PERROR("unable to get major number\n");
        goto error_after_alloc_queue;
    }

    /* Gendisk structure */
    if (!(stackbd.gd = blk_alloc_disk(NUMA_NO_NODE)))
    {
        PERROR("unable to alloc disk\n");
        goto error_after_register_blkdev;
    }

    stackbd.gd->major = major_num;
    stackbd.gd->first_minor = 0;
    stackbd.gd->minors = 1 << 4; 
    stackbd.gd->fops = &stackbd_ops;
    stackbd.gd->private_data = &stackbd;
    strcpy(stackbd.gd->disk_name, STACKBD_NAME_0);
    stackbd.queue = stackbd.gd->queue;

    if(bioset_init(&bs, 64, 0, BIOSET_NEED_BVECS) < 0)
    //if(bioset_init(&bs, BIO_POOL_SIZE, 0, 0) < 0)
    {
        PERROR( "Cannot allocate bioset");
        goto error_after_register_blkdev;
    }

    if(add_disk(stackbd.gd) < 0)
    {
        PERROR("unable to add disk\n");
        goto error_after_register_blkdev;
    }

    PINFO("init done\n");

    return 0;

error_after_register_blkdev:
    unregister_blkdev(major_num, STACKBD_NAME);
error_after_alloc_queue:
    blk_cleanup_queue(stackbd.queue);

    return -EFAULT;
}

static void __exit stackbd_exit(void)
{
    PINFO("exit\n");

    if (stackbd.is_active)
    {
        kthread_stop(stackbd.thread);
        stackbd_target_close(&stackbd.tgt);
    }

    del_gendisk(stackbd.gd);
    put_disk(stackbd.gd);
    bioset_exit(&bs);
    unregister_blkdev(major_num, STACKBD_NAME);
    blk_cleanup_queue(stackbd.queue);
}

module_init(stackbd_init);
module_exit(stackbd_exit);

https://github.com/zenbooster/stackbd/blob/5.15.0-70-generic/module/main.c

1 Comment
2024/06/09
17:55 UTC

1

How to debug KVM hypervisor text in gdb (arm64)?

In nVHE KVM model, there is a stub running in EL2 which is responsible for some services provided to the host kernel to implement KVM (eg. guest context switching, setting up certain EL2 system registers) etc.

But since EL2 only has one TTBR register (TTBR0_EL2) and the host kernel is running in high memory (TTBR1_EL1), there is a relocation that happens in run time which maps all EL2 specific code to an offset so that TTBR0_EL2 can work with.

But GDB doesnt know about this since it only looks at the static vmlinux file. Because of this, I cannot set a breakpoint in the hypervisor code because the addresses are wrong (relocated).

How do I get around this?

0 Comments
2024/06/02
05:12 UTC

2

Is it possible to create page tables when given with a list of virtual addresses?

I am trying to create a software model of hierarchical/multilevel paging.

I am currently trying to create these multilevel page tables using a list of virtual addresses. How do I go about doing this?

15 Comments
2024/05/31
01:43 UTC

1

Where does LTS end and Stable begin?

The front page of kernel.org has the following listed:

mainline: | 6.10-rc1 | 2024-05-26
stable: | 6.9.2 | 2024-05-25
stable: | 6.8.11 | 2024-05-25
longterm: | 6.6.32 | 2024-05-25
longterm: | 6.1.92 | 2024-05-25

Does this mean that 6.6.32 is the latest LTS release or that the version just before 6.8.11, which would be 6.8.10, is the latest LTS release?

5 Comments
2024/05/30
04:10 UTC

4

How to implement a pseudo-bus backed by PCIe as a Linux kernel driver?

EDIT: I was able to achieve what I wanted using a multi-function device, establishing an IRQ domain and allocating and populating an array of struct mfd_cell at parent probe-time by walking the children devicetree nodes, and passing them to devm_mfd_add_devices.


I am making a Linux kernel driver to manage a PCIe connection between a Linux-based root complex and an FPGA-based endpoint. The endpoint exposes memory mapped resources of the FPGA (IP control blocks, video buffers, etc.) on multiple BARs:

PCIe address memory map, corresponds to first device tree fragment below

I want this driver to act like a bus, so existing MMIO drivers can "Just Work" using the reg property of a devicetree to find their resources, encoded as <BAR offset size>. There are an unknown number of devices, defined only by the device tree:

my-ep-bus {
    compatible = "my-ep-bus";
    #address-cells = <2>;
    #size-cells = <1>;
    reg = <0x42000000 0 0x00006400 0x10000000 0 512>,
          /.../;

    mmio@1,40 {
        compatible = "existing-mmio-driver";
        reg = <1 0x40 0x18>;
        #address-cells = <2>;
        #size-cells = <1>;
    };

    mmio@1,80 {
        compatible = "existing-mmio-driver";
        reg = <1 0x80 0x18>;
        #address-cells = <2>;
        #size-cells = <1>;
    };

    fbuf@2,0 {
        compatible = "fb-driver";
        reg = <2 0 0x10000>;
        // ...
    };
};

Device Tree Usage states:

Since each parent node defines the addressing domain for its children, the address mapping can be chosen to best describe the system.

...
Nodes that are not direct children of the root do not use the CPU's address domain. In order to get a memory mapped address the device tree must specify how to translate addresses from one domain to another. The ranges property is used for this purpose

In their example, they use a very similar hierarchy for the address:

external-bus {
    #address-cells = <2>;
    #size-cells = <1>;
    ranges = <0 0  0x10100000   0x10000     // Chipselect 1, Ethernet
              1 0  0x10160000   0x10000     // Chipselect 2, i2c controller
              2 0  0x30000000   0x1000000>; // Chipselect 3, NOR Flash

    ethernet@0,0 {
        compatible = "smc,smc91c111";
        reg = <0 0 0x1000>;
    };

    i2c@1,0 {
        compatible = "acme,a1234-i2c-bus";
        #address-cells = <1>;
        #size-cells = <0>;
        reg = <1 0 0x1000>;
        rtc@58 {
            compatible = "maxim,ds1338";
            reg = <58>;
        };
    };

    flash@2,0 {
        compatible = "samsung,k8f1315ebm", "cfi-flash";
        reg = <2 0 0x4000000>;
    };
};

My question is: How is this actually implemented in C code? I looked through a bunch of sources for the various busses in the kernel, but the only things I saw that seemed close was the way the PCI subsystem implements it's own address translation scheme with OF, which seemed like it might require a patch to implement the same way for me?

It seems I want to implement a new struct &bus_type, but I haven't been able to figure out how or find examples to perform the correct address translation so that when children of the bus use reg, they get their resources correctly.

Any ideas? I'm open to use a different architecture if I'm barking up the wrong tree. It is important that the children devices of the EP device don't know that they are on a PCIe endpoint, just "here's your memory go nuts". Any pointers to resources would be the most helpful.

If you made it to the end, thank you <3

2 Comments
2024/05/30
00:22 UTC

30

What was your "linux kernel developer" journey like?

Coming from a microcontroller background, there are pretty good roadmaps to become a microcontroller-based products developer, aka embedded software/hardware engineer. It basically goes like this: You take a microcontroller, learn its architecture, understand it's peripheral. Then you learn to program it in assembly and then in C/C++. Make a couple of projects and there you are - job ready!!!

However, I feel lost when I try to get into Linux. There are just so many layers to this. You can work on so many different abstractions. I am not even sure if I am asking in the correct subreddit. I want to know how the people who maintain the kernel and its component got into writing/maintaining code for the kernel. There is just so so so much to learn.

How did you start and more importantly, how did you make sure that whatever you're doing to learn the stuff is correct? What do I need to learn first, where do I begin with? I might sound naive, but I want to be one of those peoples who actively contribute to the kernel. And when I think about, I feel that it's already a well established code, what would I be able to contribute to it.

I started my career two years ago as an embedded software developer (c programming on microcontroller based products) and during my first live project, I added so many bugs. Simply because the code base was around 5000 lines of code and me being a beginner, did not have a good understanding of each of the modules. Also, I am highly average. But what I think is, how do kernel developers make sure that every code change does not break the system?

Even though I do not have any understanding of the kernel, I have a deep appreciation of it and the people who make it possible. And this inspires me to become one of those people who work on the kernel. How can I be one?

Thanks a lot for reading.

11 Comments
2024/05/27
17:43 UTC

2

Can't have a tristate entry in my KConfig

I tried to add a tristate KConfig entry for my own project, but it seems it doesn't work. My KConfig:

config NETWORK_MODULE
    tristate "enable network module"

You can see from the picture below, I can't set the value of it to M by pressing M on my keyboard:

https://preview.redd.it/6lr6ap397s2d1.png?width=1328&format=png&auto=webp&s=4b30e2237a4c2bb73c9db8112c7214e7c3203cd3

8 Comments
2024/05/26
14:32 UTC

17

How would you describe being a kernel engineer in a big company?

I'm a CS graduate, currently interviewing for a job as a kernel engineer in a large company you all know. I have very little knowledge or experience in the field, and I know there's a lot to be learned until I can be beneficial to them, but if they take me I guess it's their fault XD. Anyway, wanted to ask a few generic questions about the field -

  1. What is the main thing one does on this kind of job? If you do it, do you find it interesting/exciting?
  2. Would you say experience gained as a kernel engineer is valid for embedded or other software engineering fields? I want to have relevant knowledge in case I don't find myself liking it, even though so far my OS course in uni made me like the idea of it.
  3. How well does it usually pay compared to other SWE jobs?

If you have any other advice feel free to throw them in (:

8 Comments
2024/05/25
23:20 UTC

7

Finding Kernel Devs

Hi all, hopefully not against community policy, but I am working on a project that needs deep, deep Kernel Dev input. Core kernel IO, memory management, etc. It's not a user space thang. Where can I go to find the right skillets?

8 Comments
2024/05/24
10:15 UTC

1

CPU Frequency Stability Issue

Background Information

During the CPU stress testing of the server in the environment with CentOS 7.9 and kernel version 5.15.13, it was found that the CPU frequency could not be maintained at a high frequency. Therefore, a CPU frequency stress test was conducted on the server. The following information provides a detailed description of the relevant test conditions. Please refer to it:

Test Environment

Different system versions + the same kernel version:

CentOS 7.9 + Kernel 5.15.13-1.el7

RedHat 9.1 + Kernel 5.15.13-1.el7

Test Plan 1

RHEL 9.1 system image + 5.15.13 kernel

Set BIOS system profile to performance mode

Run #cpupower idle-set -D 0

After several hours of observation, the CPU frequency can remain stable at a high frequency.

Test Plan 2

CentOS 7.9 system image + 5.15.13 kernel

Set BIOS system profile to performance mode

Run #cpupower idle-set -D 0

After several hours of observation, the CPU frequency cannot remain stable at a high frequency.

Test Plan 3

CentOS 7.9 system image + 6.8.9 kernel

Set BIOS system profile to performance mode

Run #cpupower idle-set -D 0

After several hours of observation, the CPU frequency can remain stable at a high frequency.

Test Result Questions

With the same kernel version, the system version RHEL 9.1 can keep the CPU frequency running at a high frequency, while the system version CentOS 7.9 cannot keep the CPU frequency stable. Does RHEL 9.1 have special settings for the CPU frequency? What are these settings?

The CPU frequency test was performed on the server with system version CentOS 7.9 + kernel version 6.8.9, and it can keep the CPU frequency stable at a high frequency. Does this indicate that the kernel 6.8.9 has made fixes or restrictions for CPU frequency stability? Where are these fixes or restrictions set?

0 Comments
2024/05/22
03:12 UTC

3

bpf_probe_read_{kernel/user} backports not working with bcc

I'm trying to patch an android kernel 4.9 to support probe_read_{user, kernel} and probe_read_{user, kernel} helpers. For the backporting I took example from another patch that adds bpf_probe_read_str helper. While I've patched the kernel to add the helpers and running bpftrace --info, the str helper shows up but the newly added ones don't.

I'm posting this here since I wonder if it's an issue with my kernel patch.

bpftrace output

System
  OS: Linux 4.9.337-g4fcceb75c5cd #1 SMP PREEMPT Sat May 18 17:26:12 EEST 2024
  Arch: aarch64

Build
  version: v0.19.1
  LLVM: 14.0.6
  unsafe probe: yes
  bfd: no
  libdw (DWARF support): no

libbpf: failed to find valid kernel BTF
Kernel helpers
  probe_read: yes
  probe_read_str: yes
  probe_read_user: no
  probe_read_user_str: no
  probe_read_kernel: no
  probe_read_kernel_str: no
  get_current_cgroup_id: no
  send_signal: no
  override_return: no
  get_boot_ns: no
  dpath: no
  skboutput: no
  get_tai_ns: no
  get_func_ip: no

Kernel features
  Instruction limit: -1
  Loop support: no
  btf: no
  module btf: no
  map batch: no
  uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): no

Map types
  hash: yes
  percpu hash: yes
  array: yes
  percpu array: yes
  stack_trace: yes
  perf_event_array: yes
  ringbuf: no

Probe types
  kprobe: yes
  tracepoint: yes
  perf_event: yes
  kfunc: no
  kprobe_multi: no
  raw_tp_special: no
  iter: no

This is the current diff I'm working on

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 744b4763b80e..de94c13b7193 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -559,6 +559,43 @@ enum bpf_func_id {
    */
    BPF_FUNC_probe_read_user,
 
+   /**
+   * int bpf_probe_read_kernel(void *dst, int size, void *src)
+   *     Read a kernel pointer safely.
+   *     Return: 0 on success or negative error
+   */
+   BPF_FUNC_probe_read_kernel,
+
+	/**
+	 * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+	 *     Copy a NUL terminated string from user unsafe address. In case the string
+	 *     length is smaller than size, the target is not padded with further NUL
+	 *     bytes. In case the string length is larger than size, just count-1
+	 *     bytes are copied and the last byte is set to NUL.
+	 *     @dst: destination address
+	 *     @size: maximum number of bytes to copy, including the trailing NUL
+	 *     @unsafe_ptr: unsafe address
+	 *     Return:
+	 *       > 0 length of the string including the trailing NUL on success
+	 *       < 0 error
+	 */
+	BPF_FUNC_probe_read_user_str,
+
+	/**
+	 * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+	 *     Copy a NUL terminated string from unsafe address. In case the string
+	 *     length is smaller than size, the target is not padded with further NUL
+	 *     bytes. In case the string length is larger than size, just count-1
+	 *     bytes are copied and the last byte is set to NUL.
+	 *     @dst: destination address
+	 *     @size: maximum number of bytes to copy, including the trailing NUL
+	 *     @unsafe_ptr: unsafe address
+	 *     Return:
+	 *       > 0 length of the string including the trailing NUL on success
+	 *       < 0 error
+	 */
+	BPF_FUNC_probe_read_kernel_str,
+
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a1e37a5d8c88..3478ca744a45 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -94,7 +94,7 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void *, unsafe_ptr)
+BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void  __user *, unsafe_ptr)
 {
 	int ret;
 
@@ -115,6 +115,27 @@ static const struct bpf_func_proto bpf_probe_read_user_proto = {
 };
 
 
+BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size, const void *, unsafe_ptr)
+{
+	int ret;
+
+	ret = probe_kernel_read(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_kernel_proto = {
+	.func		= bpf_probe_read_kernel,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+
 BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
 	   u32, size)
 {
@@ -487,6 +508,69 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+
+
+BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size,
+	   const void __user *, unsafe_ptr)
+{
+	int ret;
+
+	/*
+	 * The strncpy_from_unsafe() call will likely not fill the entire
+	 * buffer, but that's okay in this circumstance as we're probing
+	 * arbitrary memory anyway similar to bpf_probe_read() and might
+	 * as well probe the stack. Thus, memory is explicitly cleared
+	 * only in error case, so that improper users ignoring return
+	 * code altogether don't copy garbage; otherwise length of string
+	 * is returned that can be used for bpf_perf_event_output() et al.
+	 */
+	ret = strncpy_from_unsafe_user(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_user_str_proto = {
+	.func		= bpf_probe_read_user_str,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+
+BPF_CALL_3(bpf_probe_read_kernel_str, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	int ret;
+
+	/*
+	 * The strncpy_from_unsafe() call will likely not fill the entire
+	 * buffer, but that's okay in this circumstance as we're probing
+	 * arbitrary memory anyway similar to bpf_probe_read() and might
+	 * as well probe the stack. Thus, memory is explicitly cleared
+	 * only in error case, so that improper users ignoring return
+	 * code altogether don't copy garbage; otherwise length of string
+	 * is returned that can be used for bpf_perf_event_output() et al.
+	 */
+	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_kernel_str_proto = {
+	.func		= bpf_probe_read_kernel_str,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -500,8 +584,14 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_probe_read_proto;
 	case BPF_FUNC_probe_read_user:
 		return &bpf_probe_read_user_proto;
+	case BPF_FUNC_probe_read_kernel:
+		return &bpf_probe_read_kernel_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+	case BPF_FUNC_probe_read_user_str:
+		return &bpf_probe_read_user_str_proto;
+	case BPF_FUNC_probe_read_kernel_str:
+		return &bpf_probe_read_kernel_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_tail_call:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 155ce25c069d..91d5691288a7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -522,7 +522,44 @@ enum bpf_func_id {
    *     Return: 0 on success or negative error
    */
    BPF_FUNC_probe_read_user,
+
+   /**
+   * int bpf_probe_read_kernel(void *dst, int size, void *src)
+   *     Read a kernel pointer safely.
+   *     Return: 0 on success or negative error
+   */
+   BPF_FUNC_probe_read_kernel,
 	
+	/**
+	 * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+	 *     Copy a NUL terminated string from user unsafe address. In case the string
+	 *     length is smaller than size, the target is not padded with further NUL
+	 *     bytes. In case the string length is larger than size, just count-1
+	 *     bytes are copied and the last byte is set to NUL.
+	 *     @dst: destination address
+	 *     @size: maximum number of bytes to copy, including the trailing NUL
+	 *     @unsafe_ptr: unsafe address
+	 *     Return:
+	 *       > 0 length of the string including the trailing NUL on success
+	 *       < 0 error
+	 */
+	BPF_FUNC_probe_read_user_str,
+
+	/**
+	 * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
+	 *     Copy a NUL terminated string from unsafe address. In case the string
+	 *     length is smaller than size, the target is not padded with further NUL
+	 *     bytes. In case the string length is larger than size, just count-1
+	 *     bytes are copied and the last byte is set to NUL.
+	 *     @dst: destination address
+	 *     @size: maximum number of bytes to copy, including the trailing NUL
+	 *     @unsafe_ptr: unsafe address
+	 *     Return:
+	 *       > 0 length of the string including the trailing NUL on success
+	 *       < 0 error
+	 */
+	BPF_FUNC_probe_read_kernel_str,
+  
   __BPF_FUNC_MAX_ID,
 };

This is also a follow-up of the following patch that adds probe_read_user which now I see it didn't worked either

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 67d7d771a944..744b4763b80e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -552,6 +552,13 @@ enum bpf_func_id {
 	 */
 	BPF_FUNC_get_socket_uid,
 
+   /**
+   * int bpf_probe_read_user(void *dst, int size, void *src)
+   *     Read a userspace pointer safely.
+   *     Return: 0 on success or negative error
+   */
+   BPF_FUNC_probe_read_user,
+
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 59182e6d6f51..a1e37a5d8c88 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -94,35 +94,27 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, const void *, unsafe_ptr)
+BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size, const void *, unsafe_ptr)
 {
 	int ret;
 
-	/*
-	 * The strncpy_from_unsafe() call will likely not fill the entire
-	 * buffer, but that's okay in this circumstance as we're probing
-	 * arbitrary memory anyway similar to bpf_probe_read() and might
-	 * as well probe the stack. Thus, memory is explicitly cleared
-	 * only in error case, so that improper users ignoring return
-	 * code altogether don't copy garbage; otherwise length of string
-	 * is returned that can be used for bpf_perf_event_output() et al.
-	 */
-	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+	ret = probe_user_read(dst, unsafe_ptr, size);
 	if (unlikely(ret < 0))
 		memset(dst, 0, size);
 
 	return ret;
 }
 
-static const struct bpf_func_proto bpf_probe_read_str_proto = {
-	.func           = bpf_probe_read_str,
-	.gpl_only       = true,
-	.ret_type       = RET_INTEGER,
+static const struct bpf_func_proto bpf_probe_read_user_proto = {
+	.func		= bpf_probe_read_user,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_RAW_STACK,
 	.arg2_type	= ARG_CONST_STACK_SIZE,
 	.arg3_type	= ARG_ANYTHING,
 };
 
+
 BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
 	   u32, size)
 {
@@ -506,6 +498,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_map_delete_elem_proto;
 	case BPF_FUNC_probe_read:
 		return &bpf_probe_read_proto;
+	case BPF_FUNC_probe_read_user:
+		return &bpf_probe_read_user_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
 	case BPF_FUNC_ktime_get_ns:
@@ -534,8 +528,6 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_current_task_under_cgroup_proto;
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
-	case BPF_FUNC_probe_read_str:
-		return &bpf_probe_read_str_proto;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a339bea1f4c8..155ce25c069d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -516,7 +516,14 @@ enum bpf_func_id {
 	 */
 	BPF_FUNC_get_socket_uid,
 
-	__BPF_FUNC_MAX_ID,
+   /**
+   * int bpf_probe_read_user(void *dst, int size, void *src)
+   *     Read a userspace pointer safely.
+   *     Return: 0 on success or negative error
+   */
+   BPF_FUNC_probe_read_user,
+	
+  __BPF_FUNC_MAX_ID,
 };
 
 /* All flags used by eBPF helper functions, placed here. */
2 Comments
2024/05/19
12:43 UTC

2

I encountered this problem when using the kernel

I tried to compile the kernel using kernel modules to implement hook system calls according to https://www.cnblogs.com/lanrenxinxin/p/6289436.html He mentioned that the kernel enforces memory limits, causing this feature to not work properly. Specifically, the stock Lollipop and Marshmallow kernels are built with the CONFIG_STRICT_MEMORY_RWX option enabled,

The kernel I used is https://github.com/LowTension/BAALAM_android_kernel_xiaomi_sm8250

I did not find CONFIG_STRICT_MEMORY_RWX in my kernel's configuration file, I should solve the problem I e

[  126.609564] hello world!
[  126.669254] Unable to handle kernel write to read-only memory at virtual address ffffffa468c009a8
[  126.669260] Mem abort info:
[  126.669263]   ESR = 0x9600004e
[  126.669268]   Exception class = DABT (current EL), IL = 32 bits
[  126.669271]   SET = 0, FnV = 0
[  126.669273]   EA = 0, S1PTW = 0
[  126.669276] Data abort info:
[  126.669278]   ISV = 0, ISS = 0x0000004e
[  126.669281]   CM = 0, WnR = 1
[  126.669285] swapper pgtable: 4k pages, 39-bit VAs, pgdp = 00000000b75a968c
[  126.669288] [ffffffa468c009a8] pgd=000000027fffe003, pud=000000027fffe003, pmd=00600000a1a00791
[  126.669297] Internal error: Oops: 9600004e [#1] PREEMPT SMP
[  126.669302] Modules linked in: krhook(FO+) sla(FO)
[  126.669308] Process insmod (pid: 10171, stack limit = 0x000000002907ea0c)
[  126.669313] CPU: 6 PID: 10171 Comm: insmod Tainted: GFS      W  O      4.19.303-Puls #4
[  126.669317] Hardware name: Qualcomm Technologies, Inc. xiaomi umi (DT)
[  126.669321] pstate: 60400005 (nZCv daif +PAN -UAO)
[  126.669328] pc : syscall_hook_init+0x108/0x160 [krhook]
[  126.669333] lr : syscall_hook_init+0xe8/0x160 [krhook]
[  126.669336] sp : ffffff802c52bb20
[  126.669338] x29: ffffff802c52bb20 x28: 0000000000000000 
[  126.669342] x27: ffffff8011db6438 x26: 0000000000000023 
[  126.669345] x25: 0000000000000160 x24: ffffffa469907000 
[  126.669348] x23: ffffffa452695000 x22: ffffffa452695000 
[  126.669351] x21: ffffffc5abd05a00 x20: ffffffa452695000 
[  126.669354] x19: ffffffa452695000 x18: 0000000000000000 
[  126.669357] x17: 0000000000000000 x16: 0000000000000000 
[  126.669360] x15: 0000000000000082 x14: ffffffa4699fffff 
[  126.669363] x13: ffffffa469a00000 x12: ffffffa469eeba70 
[  126.669367] x11: ffffffa45269321c x10: ffffffa452695000 
[  126.669370] x9 : ffffffa46749eef4 x8 : ffffffa468c007e8 
[  126.669373] x7 : ffffffa4699fffff x6 : 0068000000000713 
[  126.669376] x5 : 0000000000000000 x4 : ffffffbefe63c000 
[  126.669379] x3 : 0060000000000793 x2 : 0000000000000041 
[  126.669382] x1 : ffffffa469eeb000 x0 : ffffffa46ab34000 
[  126.669386] Call trace:
[  126.669390]  syscall_hook_init+0x108/0x160 [krhook]
[  126.669398]  do_one_initcall+0x16c/0x2dc
[  126.669404]  do_init_module+0x4c/0x1e0
[  126.669407]  load_module+0x1228/0x1358
[  126.669411]  __arm64_sys_finit_module+0xac/0xe4
[  126.669416]  el0_svc_common+0x98/0x160
[  126.669420]  el0_svc_handler+0x60/0x78
[  126.669423]  el0_svc+0x8/0x380
[  126.669428] Code: f940e109 d280f263 f2e00c03 f9000949 (f900e10b) 
[  126.669432] ---[ end trace e3f1c8293fdb20e1 ]---
[  126.669450] Kernel panic - not syncing: Fatal exception
[  126.669457] SMP: stopping secondary CPUs
[  126.669710] CPU3: stopping
7 Comments
2024/05/17
04:38 UTC

0

Is kernel driver and kernel permission the same?

I'm new to tech and recently started to learn about Kernel because a friend of mine and I started to fight about the Vanguard anti-cheat in League of Legends. I wanted to ask: are kernel driver and kernel permission a similar type of concept? Thank you for the answers in advance.

2 Comments
2024/05/17
03:58 UTC

0

How to debug a Linux distribution? (Read body)

I am trying to understand KVM and want to debug it using GDB.

I am currently compiling the kernel from source and running it in QEMU with GDB. But I dont have a full fledged userspace to run qemu on top of it. Just a basic shell obviously.

I was thinking if I could probably run a Ubuntu image (instead of the compiled kernel) on qemu and attach GDB to it.

Is it possible? Will the regular vmlinux symbol file work with it?

3 Comments
2024/05/15
12:54 UTC

4

How to fine tune a kernel for latency

Hello, i was wondering what are the most commons way to fine tune a kernel to reduce its latency for specific low latency usecase, like high frequency trading where you need fastest execution and IO, by that i mean how to choose the kernel, then what are the main ideas behind the tuning, and perhaps some examples would be nice.
If anyone here is experimented on this subject id appreciate some advanced resources as well it would be really nice!

6 Comments
2024/05/12
21:24 UTC

4

Why does HYP and Kernel have different virtual addresses in nVHE?

There are a lot of places in the kernel where kern_hyp_va is used to translate symbols which in turn calls __kern_hyp_va(). This is the comment in the source code.

/*
 * Convert a kernel VA into a HYP VA.
 *
 * Can be called from hyp or non-hyp context.
 *
 * The actual code generation takes place in kvm_update_va_mask(), and
 * the instructions below are only there to reserve the space and
 * perform the register allocation (kvm_update_va_mask() uses the
 * specific registers encoded in the instructions).
 */
static __always_inline unsigned long __kern_hyp_va(unsigned long v)
{ ... }

But in nVHE and protected KVM disabled, doesnt the kernel and HYP code in the same address space? Why do we need to tranlate virtual addresses?

0 Comments
2024/05/12
18:16 UTC

11

How did you find what to work on for your first kernel patch

How long did you work on it and did you have anyone to ask for help like a mentor? I'm also curious to see the first patches if anybody can link theirs

11 Comments
2024/05/11
22:50 UTC

9

Driver development resources for updates to the kernel since Linux Device Drivers 3rd Edition was released?

I'm in the process of reading through Linux Device Drivers 3rd Edition as it seems like a good resource to build a foundation, but I know that there have been many changes since its release in 2005. What resources would you suggest for filling in the gaps one might have in modern Linux driver development, assuming a foundational knowledge provided by LDD3?

Thanks in advance for your time and help.

2 Comments
2024/05/11
20:17 UTC

6

Why are there two page table directories in arm64 kernel?

During boot, create_idmap creates an idmap of the kernel and uses the init_idmap_pg_dir. But then in __primary_switch when we enable the mmu, we load init_idmap_pg_dir to ttbr0_el1 and init_pg_dir to ttbr1_el1.

Why two page tables? And isnt the kernel always idmapped?

0 Comments
2024/05/10
18:17 UTC

2

What is PoC and PoU?

During boot in head.S (arm64), we call dcache_clean_poc() which is defined in arch/arm64/mm/cache.S with another function called dcache_clean_pou(). The comment above it says:

Ensure that any D-cache lines for the interval [start, end) re cleaned to the PoC.

So what is PoC and PoU why do we have to clean them?

4 Comments
2024/05/09
06:54 UTC

2

How does kernel configure GIC CPU interface registers for each core?

I was going through the GIC manual and its mentioned that each core has its own CPU interface and it can be configured using ICC_*_ELn registers which are "memory mapped".

But how can all cores separately configure their CPU interface's registers when its memory mapped? Don't all PEs have the same view of memory?

1 Comment
2024/05/07
07:34 UTC

1

how often to update 6.x kernel?

Until recently, I've been running kernel 5.x on my laptops (whatever the latest LTS kernel is). I've purchased a min PC with the Intel N100 processor, and quickly learned I needed the 6.5 kernel.

Just wondering - how quickly are improvements made to the kernel? I used to only update my kernel once every few months - should I be doing that more often with the 6.5 kernel?

Thanks.

10 Comments
2024/05/05
12:27 UTC

7

Trying to understand the build process behind kernel modules

Trying to understand the build process behind kernel modules

In a simple driver Makefile, you invoke:

make -C /lib/modules/`uname -r`/build modules M=`pwd`

/lib/modules/uname -r/build is a symbolic link to /usr/src/linux-headers-4.15.0-142-generic, so when we invoke make -C, you change to /usr/src/linux-headers-4.15.0-142-generic and then invoke make with modules as target and the M being set to the workding directory. M is the output directory of the make invocation.

The relevant comment from /src/linux-headers-4.15.0-142-generic/Makefile

# Use make M=dir to specify directory of external module to build 

You also have:

obj-m := my_driver.o
my_driver-objs := src1.o src2.o

Where obj-m is the name of kernel module and $(KERNEL_MODULE_NAME)-objs are the source files. The only reference to these to obj-m is

# Build modules
#
# A module can be listed more than once in obj-m resulting in
# duplicate lines in modules.order files.  Those are removed
# using awk while concatenating to the final file.

Then we get to the module target, which is:

PHONY += modules
modules: $(vmlinux-dirs) $(if $(KBUILD_BUILTIN),vmlinux) modules.builtin                                                                              
    $(Q)$(AWK) '!x[$$0]++' $(vmlinux-dirs:%=$(objtree)/%/modules.order) > $(objtree)/modules.order
    @$(kecho) '  Building modules, stage 2.';
    $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modpost

modules.builtin: $(vmlinux-dirs:%=%/modules.builtin)
    $(Q)$(AWK) '!x[$$0]++' $^ > $(objtree)/modules.builtin

%/modules.builtin: include/config/auto.conf
    $(Q)$(MAKE) $(modbuiltin)=$*


# Target to prepare building external modules
PHONY += modules_prepare
modules_prepare: prepare scripts

And to be frank, this is when it stargs going over my head. I'm not an expert with Make and prefer cmake when I can. But I guess my overarching question, how important is fully understanding this? I know the commands, but when it comes to the actual build process and the specifics are fuzzy for me.

0 Comments
2024/05/03
19:13 UTC

2

Why is linux kernel not booting under ARM TF-A?

0 Comments
2024/05/02
17:37 UTC

Back To Top