18,123 Subscribers

follow_page() on x86

Hi, I was looking at the implementation of follow_page for 32bit x86 and I'm confused about how it handles the pud and pmd. Based on the code it does not seem to handle it correctly and I would have assumed that pud_offset and pmd_offset would have 0 as their 2nd argument so that these functions fold back onto the pgd entry. What am I missing?


static struct page *
__follow_page(struct mm_struct *mm, unsigned long address, int read, int write)
{
        pgd_t *pgd;
        pud_t *pud;
        pmd_t *pmd;
        pte_t *ptep, pte;
        unsigned long pfn;
        struct page *page;

        page = follow_huge_addr(mm, address, write);
        if (! IS_ERR(page))
                return page;

        pgd = pgd_offset(mm, address);
        if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
                goto out;

        pud = pud_offset(pgd, address);
        if (pud_none(*pud) || unlikely(pud_bad(*pud)))
                goto out;
        
        pmd = pmd_offset(pud, address);
        if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
                goto out;
        if (pmd_huge(*pmd))
                return follow_huge_pmd(mm, address, pmd, write);

        ptep = pte_offset_map(pmd, address);
        if (!ptep)
                goto out;

        pte = *ptep;
        pte_unmap(ptep);
        if (pte_present(pte)) {
                if (write && !pte_write(pte))
                        goto out;
                if (read && !pte_read(pte))
                        goto out;
                pfn = pte_pfn(pte);
                if (pfn_valid(pfn)) {
                        page = pfn_to_page(pfn);
                        if (write && !pte_dirty(pte) && !PageDirty(page))
                                set_page_dirty(page);
                        mark_page_accessed(page);
                        return page;
                }
        }

out:
        return NULL;
}

4 Comments

2025/02/03
02:23 UTC

Is futex_wait_multiple accessible from userspace?

I'm trying to figure out how/if I can call futex_wait_multiple from an application. I'm on kernel 6.9.3 (Ubuntu 24.04). As far as I can tell from the kernel sources, futex_wait_multiple is implemented in futex/waitwake.c, but there's no mention of it in the futex(2) manpage or in any of my kernel headers.

4 Comments

2025/01/22
18:47 UTC

Can I submit a driver upstream to the kernel if it wasn't written by me?

I recently found a driver on GitHub that seems to work. An equivalent driver is not currently in the kernel tree. The driver was not written by me, but has appropriate Copyright/compatible license headers in each file.

Can I modify the driver and upstream it to the kernel? I would happily maintain it, and I would probably drop it off in staging for a while, but are there any issues with me submitting code that I have not wholly written? I would of course audit all of it first.

8 Comments

2025/01/22
04:49 UTC

Will Linux allocate pids < 300 to user processes?

I was looking at the Linux 2.6.11 pid allocation function alloc_pidmap which is called during process creation. Essentially, there's a variable last_pid which is initially 0, and every time alloc_pidmap is called, the function starts looking for free pids starting from last_pid + 1. If the current pid it's trying to allocate is greater than the maximum pid, it wraps around to RESERVED_PIDS which is 300. What I don't understand is that it doesn't seem to prevent pids < 300 from being given to user processes. Am I missing something or will Linux indeed give pids < 300 to user processes. And why bother setting the pid offset to RESERVED_PIDS upon a wrap around if it doesn't prevent those being allocated the first time around. I've included the function in a paste bin for reference: https://pastebin.com/pnGtZ9Rm

4 Comments

2025/01/20
16:42 UTC

kswapd0 bottlenecks heavy IO

Hi,

I am working on some data processing system, which pushes some GB/s to nvme disks using mmaped files.

I often observe that CPU cores are underloaded by my expectation (say I run 30 concurrent threads, but see app has around 600% CPU load), but there is kswapd0 process which has 100% CPU load.

My understanding is that kswapd0 is responsible for reclaiming memory pages, and looks like it reclaims pages not fast enough because of being single-threaded and bottlenecks the system.

Any ideas how this can be improved? I am wondering if there is some multithreaded implementation of kswapd0 which could be enabled?

Thank you.

9 Comments

2025/01/20
03:01 UTC

NIC Driver - Performance - ndo_start_xmit shows dma_map_single alone takes up ~20% of CPU for UDP packets.

Summary

Trying to understand performance issue with Linux's network stack between UDP and TCP. And also why the rtl8126 driver has performance issues with DMA access, but only on UDP.

I have most of my details in my Github link, but I'll add some details here too.

Main Question

Any idea why dma_map_single is very slow for skb->data for UDP packets, but much faster for TCP? It looks like it is about a 2x difference between TCP vs UDP.

* So I found out the reason why TCP seems more performant is than UDP, there is a caveat to iperf3. I observed in htop that there are no where as many packets with TCP, even though I set -l 64 on iperf3. I tried setting --set-mss 88 (the lowest allowed by my system) but the packet size was still sending at about 500 bytes. So basically the tests I have been doing were not 1-to-1 between UDP and TCP, however I still don't understand exactly why TCP packets are much bigger than I ask iperf3 to send. Maybe something the kernel does to group them together into less skbs? Anyone know?

Second Question

Why does dma_map_single and dma_unmap_single take so much CPU time? In the Dynamic DMA mapping Guide - Optimizing Unmap State Space Consumption guide I noted this line:

On many platforms, dma_unmap_{single,page}() is simply a nop.

However, in my testing on this Intel 8500t machine this dma_unmap_single takes a lot of CPU and would like to understand when it is or isn't a nop.

dma_unmap_single takes a lot of CPU time, when on \"many platforms\" it shouldn't according to the Linux docs.

My Machine

Motherboard: HP ProDesk 400 G4 DM (lastet BIOS)

CPU: Intel 8500t

RAM: Dual channel 2x4GB DDR4 3200

NIC: rtl8126

Kernel: 6.11.0-2-pve

Software: iperf3 3.18

Linux Params - Network stack:
find /proc/sys/net/ipv4/ -name "udp*" -exec sh -c 'echo -n "{}:"; cat {}' \;

find /proc/sys/net/core/ -name "wmem_*" -exec sh -c 'echo -n "{}:"; cat {}' \;

/proc/sys/net/ipv4/udp_child_hash_entries:0

/proc/sys/net/ipv4/udp_early_demux:1
/proc/sys/net/ipv4/udp_hash_entries:4096
/proc/sys/net/ipv4/udp_l3mdev_accept:0
/proc/sys/net/ipv4/udp_mem:170658 227544 341316
/proc/sys/net/ipv4/udp_rmem_min:4096
/proc/sys/net/ipv4/udp_wmem_min:4096
/proc/sys/net/core/wmem_default:212992
/proc/sys/net/core/wmem_max:212992

3 Comments

2025/01/19
23:08 UTC

A 2.6.11 32-bit kernel in QEMU keeps using high CPU even when it's idle.

I'm running a 2.6.11 32-bit kernel in qemu, with kvm enabled.
Even though it's idle, the cpu usage in the host is quite high.
( The sound of the cpu fan complains that. )

=== qemu command line ===
# bind it to core-0
taskset -c 0 qemu-system-x86_64 -m 4G -accel kvm \
-kernel bzImage -initrd initrd.cpio.gz \
-hda vm1.qcow2 \
-append 'console=ttyS0' \
-nographic
=========================

`top -d 1` shown two processes occupied most of the cpu time.
- qemu-system-x86_64
- kvm-pit/42982

Following are 30 seconds cpu-sampling of these two processes.

=== pidstat 30 -u -p $(pidof qemu-system-x86_64) ===
   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
  1000      3971    1.50    4.73    3.60    0.00    9.83     0  qemu-system-x86
====================================================

=== sudo pidstat 30 -u -p 42988 ===
   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
     0     42988    0.00    2.10    0.00    0.00    2.10     1  kvm-pit/42982
====================================

Almost 12% of cpu time spent on this idle vm with only a Bash shell waiting for input.
To Compare, I run a cloud image of Alpine Linux with kernel 6.12.8-0-virt, 
`top -d 1` shown only 1-2% cpu usage.
So it's unusual, and unacceptable, something's broken.

=== Run Alpine Linux ===
qemu-system-x86_64 -m 4G -accel kvm \
-drive if=virtio,file=alpine1.qcow2 -nographic
========================

=== `top -d 1` from guest vm ===
top - 02:02:10 up 6 min,  0 users,  load average: 0.00, 0.00, 0.00
Tasks:  19 total,   1 running,  18 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 96.2% id,  0.0% wa,  3.8% hi,  0.0% si
Mem:    904532k total,    12412k used,   892120k free,      440k buffers
Swap:        0k total,        0k used,        0k free,     3980k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  903 root      16   0  2132 1024  844 R  3.8  0.1   0:00.76 top
    1 root      25   0  1364  352  296 S  0.0  0.0   0:00.40 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      39  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/0
    5 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 khelper
   10 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kthread
   18 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kacpid
   99 root      18  -5     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
  188 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
  112 root      25   0     0    0    0 S  0.0  0.0   0:00.00 khubd
  189 root      15   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
  191 root      18  -5     0    0    0 S  0.0  0.0   0:00.00 aio/0
  190 root      25   0     0    0    0 S  0.0  0.0   0:00.00 kswapd0
  781 root      25   0     0    0    0 S  0.0  0.0   0:00.00 kseriod
  840 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 ata/0
  844 root      17   0     0    0    0 S  0.0  0.0   0:00.00 khpsbpkt
=====================================

It's quite idle, except the `top` process.

kvm-pit(programmable inteval timer), maybe related to the timer?

=== extracted from dmesg in guest ===
Using tsc for high-res timesource
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 pin1=2 pin2=-1
PCI: Using ACPI for IRQ routing
** PCI interrupts are no longer routed automatically.  If this
** causes a device to stop working, it is probably because the
** driver failed to call pci_enable_device().  As a temporary
** workaround, the "pci=routeirq" argument restores the old
** behavior.  If this argument makes the device work again,
** please email the output of "lspci" to bjorn.helgaas@hp.com
** so I can fix the driver.
Machine check exception polling timer started.
=======================================

Also I took a flamegraph of the QEMU process.

=== Get flamegraph by using https://github.com/brendangregg/FlameGraph ===
> perf record -F 99 -p $(pidof qemu-system-x86_64) -g -- sleep 30
> perf script > out.perf
> stackcollapse-perf.pl out.perf > out.folded
> flamegraph.pl out.folded > perf.svg
========================================================================
( screenshot of this svg shown below )

The svg file is uploaded here:
https://drive.google.com/file/d/1KEMO2AWp08XgBGGWQimWejrT-vLK4p1w/view

=== PS ===
The reason why I run this quite old kernel is that 
I'm reading the book "Understand the Linux Kernel" which uses kernel 2.6.11. 
It's easy to follow when using the same version as the author.
==========

https://preview.redd.it/zkm5u6dh02ee1.png?width=1341&format=png&auto=webp&s=f4e9d6ace820e8eefaaace406a378587cc851184

7 Comments

2025/01/19
23:03 UTC

Is reading ‘Computer Architecture a quantitative approach ~ John L hennessy, David A patterson’ book worthwhile in the linux kernel’s learning journey?

10 Comments

2025/01/18
15:53 UTC

Is is possible to connect two Tap devices without bridge, by utilizing the host machine as a router?

I know it's trivial to use bridge to achieve this.
But I just wonder if it's possible without bridge.

Said, vm1.eth0 connects to tap1, vm2.eth0 connects to tap2.

vm1.eth0's address is 192.168.2.1/24
vm2.eth0's address is 192.168.3.1/24

These two are of different subnet, and use the host machine
as a router to communicate each other.

=== Topology
      host
-----------------
   |         |
  tap1      tap2
   |         |
vm1.eth0  vm2.eth0
========================

=== Host
tap1 2a:15:17:1f:20:aa no ip address
tap2 be:a1:5e:56:29:60 no ip address

> ip route
192.168.2.1 dev tap1 scope link
192.168.3.1 dev tap2 scope link
====================================

=== VM1
eth0 52:54:00:12:34:56 192.168.2.1/24

> ip route
default via 192.168.2.1 dev eth0
=====================================

=== VM2
eth0 52:54:00:12:34:57 192.168.3.1/24

> ip route
default via 192.168.3.1 dev eth0
=====================================

=== Now in vm1, ping vm2
> ping 192.168.3.1
( stuck, no output )
======================================

=== In host, tcpdump tap1
> tcpdump -i tap1 -n
ARP, Request who-has 192.168.3.1 tell 192.168.2.1, length 46
============================================================

As revealed by tcpdump, vm1 cannot get ARP reply,
since vm1 and vm2 isn't physically connected,
that's tap1 and tap2 isn't physically connected.
So I try to use ARP Proxy.

=== Try to use ARP proxy
# In host machine
> echo 1 | sudo tee /proc/sys/net/ipv4/conf/all/proxy_arp

# In vm1
> arping 192.168.3.1
Unicast reply from 192.168.3.1 [2a:15:17:1f:20:aa] 0.049ms
==========================================================

Well it did get a reply, but it's wrong!
`2a:15:17:1f:20:aa` is the macaddr of tap1!

So my understanding of ARP proxy is wrong.
I have Googled around the web, but got no answers.

Thanks.

6 Comments

2025/01/18
02:27 UTC

Why preemptible rcu need two stage

I recently read this post: https://lwn.net/Articles/253651/ and have some understand about preemptible rcu.

But why does a full grace period consist of two stages?

Isn't it guaranteed that all CPUs are no longer using old values after one stage ends?

0 Comments

2025/01/17
02:24 UTC

[Bug?] Fedora's Bluetooth LE Privacy always defaults to disabled on fresh install, even when supported by hardware - would this be the cause?

Edit: Nvm i think i was misreading the structure hci_alloc_dev_priv, as privacy instead of private :')

I've noticed this issue across multiple Fedora installations:

Bluetooth LE Privacy (address randomization) is always disabled by default, even when the hardware supports it.

- Fresh Fedora install always has Bluetooth privacy disabled

- Even when hardware supports random addresses (verified with `btmgmt info`)

- Happens consistently across different machines/installs (all with intel cpu though)

~~Looking at hci_core.c in the kernel source, when a new Bluetooth device gets registered, it appears the HCI Link Layer privacy flag is being forced to 0 during initialization.~~

~~c hdev = kzalloc(alloc_size, GFP_KERNEL); if (!hdev) return NULL;~~

I am most likely missing a piece to the puzzle somewhere, I am extremely new to C and delving into the kernel. But would this be a bug or an intended feature?

edit:

Upon further investigation, it appears that the privacy mode setting is defaulting to Device Privacy (0x00) even when explicitly set to Device Privacy (0x01). This behavior occurs despite the correct definition in hci.h:

#define HCI_NETWORK_PRIVACY0x00
#define HCI_DEVICE_PRIVACY0x01

#define HCI_OP_LE_SET_PRIVACY_MODE0x204e
struct hci_cp_le_set_privacy_mode {
__u8  bdaddr_type;
bdaddr_t  bdaddr;
__u8  mode;
} __packed;

!also forgive me for my terrible formatting on here, idk wtf is happening !<

1 Comment

2025/01/15
01:37 UTC

how do i identify git commit id by kernel version.

Hello, i pretty understand that this question was asked for dozen times but I still wonder how to find a proper answer for this. So, I downloaded
https://www.kernel.org/pub/linux/kernel/v6.x/linux-6.6.69.tar.xz
and found commit from changelog that corresponds to:

commit a30cd70ab75aa6b7ee880b6ec2ecc492faf205b2
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Thu Jan 2 10:32:11 2025 +0100

    Linux 6.6.69
    
    Link: 
    Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
    Tested-by: Shuah Khan <skhan@linuxfoundation.org>
    Tested-by: kernelci.org bot <bot@kernelci.org>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
    Tested-by: Hardik Garg <hargar@linux.microsoft.com>
    Tested-by: Ron Economos <re@w6rz.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>https://lore.kernel.org/r/20241230154211.711515682@linuxfoundation.org

but have no idea how to identify it in original source tree. How it works? Probably, other remotes should be added?

git co a30cd70ab75aa6b7ee880b6ec2ecc492faf205b2

fatal: unable to read tree (a30cd70ab75aa6b7ee880b6ec2ecc492faf205b2)

5 Comments

2025/01/14
19:17 UTC

Are developing Kernels fun?

Hi all, just saw a video on youtube regarding linux kernel development and the person in that video said that developing kernels are boring because there is just bug fixings and nothing else. I don't know anything about linux kernels (I just know they are bridge b/w software and hardware). I am getting attracted to embedded & kernels because I like the idea of controlling hardware with my code. As, linux kernel development can be a main job for many embedded engineers, I really want to validate the enjoyment of developing kernels? Is it just fixing someone else's code or bugs? If anyone can share some insights in this topic, I will be really grateful. Thnaks.

23 Comments

2025/01/13
18:46 UTC

Lazy TLB mode Linux 2.6.11

Hello,

I'm looking at the TLB subsystem code in Linux 2.6.11 and was trying to understand Lazy TLB mode. My understanding is that when a kernel thread is scheduled, the CPU is put in the TLBSTATE_LAZY mode. Upon a TLB invalidate IPI, the CPU executes the do_flush_tlb_all function which first invalidates the TLB, then checks if the CPU is in TLBSTATE_LAZY and if so clears it's CPU number in the memory descriptor cpu_vm_mask so that it won't get future TLB invalidations.

My question is why doesn't the do_flush_tlb_all check whether the CPU is in TLBSTATE_OK before calling __flush_tlb_all to invalidate its local TLB. I thought the whole point of the lazy tlb state was to avoid flushing the TLB while a kernel thread executes because its virtual addresses are disjoint from user virtual addresses.

A sort of tangential question I have is the tlb_state variable is declared as a per CPU variable. However, all of the per-cpu variable code in this version of Linux seems to belong to x86-64 and not i386. Even in the setup.c for i386 I don't see anywhere where the per-cpu variables are loaded, but I see it in setup64.c. What am I missing?

Thank you

5 Comments

2025/01/10
21:00 UTC

What’s the good book that teaches advanced C concepts with respect to Linux?

4 Comments

2025/01/10
09:36 UTC

How do I create my own kernel

I wanna create my own kernel . I don't know where to start. Please give me a roadmap for concepts and skills to learn to do so. I'm good at c and c++ . Also have a higher level idea of os don't know too much tho..

Also mention resources pls

Thanks 👍

2 Comments

2025/01/10
09:31 UTC

I Wanna Learn How To Compile Kernel

I wanna compile all the code by myself and use it.. how do I do it ? I don't have any prior experience.. pls help

13 Comments

2025/01/09
00:19 UTC

DRM: GEM buffer is rendered only if unmaped before each rendering

So, I'm trying to understand Linux graphics stack and I came up with this small app, rendering test pattern on a screen. It utilizes libdrm and libgbm from Mesa for managing GEM buffers.

The problem I faced is that in order to render GEM buffer (in legacy manner using drmModeSetCrtc) it should be unmapped before each call to drmModeSetCrtc.

 for (int i = 0; i < 256; ++i) {
    fb = (xrgb8888_pixel *)gbm_bo_map(
        ctx->gbm_bo, 0, 0, gbm_bo_get_width(ctx->gbm_bo),
        gbm_bo_get_height(ctx->gbm_bo), GBM_BO_TRANSFER_READ_WRITE, &map_stride,
        &map_data);

   int bufsize = map_stride * ctx->mode_info.vdisplay;

   /* Draw something ... */

    gbm_bo_unmap(ctx->gbm_bo, &map_data);
    map_data = NULL;
    drmModeSetCrtc(ctx->card_fd, ctx->crtc_id, ctx->buffer_handle, 0, 0,
                   &ctx->conn_id, 1, &ctx->mode_info);
    
  }

For some reason the following code does nothing :

  fb = (xrgb8888_pixel *)gbm_bo_map(
        ctx->gbm_bo, 0, 0, gbm_bo_get_width(ctx->gbm_bo),
        gbm_bo_get_height(ctx->gbm_bo), GBM_BO_TRANSFER_READ_WRITE, &map_stride,
        &map_data);

  for (int i = 0; i < 256; ++i) {

   int bufsize = map_stride * ctx->mode_info.vdisplay;

    /* Draw something ... */

    drmModeSetCrtc(ctx->card_fd, ctx->crtc_id, ctx->buffer_handle, 0, 0,
                   &ctx->conn_id, 1, &ctx->mode_info);
  }

  gbm_bo_unmap(ctx->gbm_bo, &map_data);

Placing gbm_bo_unmap in the loop after drmModeSetCrtc also does nothing. Of course multiple calls to gbm_bo_map and gbm_bo_unmap would cause undesirable overhead in performance sensitive app. The question is how to get rid of these calls? Is it possible to map buffer only once, so that any change to it would be seen to graphics card without unmapping?

8 Comments

2025/01/06
09:55 UTC

which version of gcc can compile kernel 2.6.11?

I'm reading the book "Understanding the Linux Kernel, Third Edition". The kernel version used in the book is 2.6.11.

I tried to compile it with gcc 4.6.4 in a Docker container. But failed with following messages:

arch/x86_64/kernel/process.c: Assembler messages:
arch/x86_64/kernel/process.c:459: Error: unsupported for `mov'
arch/x86_64/kernel/process.c:463: Error: unsupported for `mov'
arch/x86_64/kernel/process.c:393: Error: unsupported for `mov'
arch/x86_64/kernel/process.c:394: Error: unsupported for `mov'
arch/x86_64/kernel/process.c:395: Error: unsupported for `mov'
arch/x86_64/kernel/process.c:396: Error: unsupported for `mov'
make[1]: *** [arch/x86_64/kernel/process.o] Error 1
make: *** [arch/x86_64/kernel] Error 2

The build instructions is

make allnoconfig
make -j$(nproc)

The kernel source code is fetched from 2.6.11.1

The Docker image used is `gcc:4.6.4`.

4 Comments

2025/01/05
09:07 UTC

I want to learn Linux kernel development, but I have no idea where to start.

Hello,

As mentioned in the header, I have no idea where to start learning about the Linux kernel. I feel like I’m even worse than a beginner because I don’t have any knowledge of Linux programming, kernels, drivers, etc.

I do have a solid understanding of the C programming language in Ubuntu environment.

I have planned to enroll in an academy that specializes in teaching Linux, covering topics from system programming to device drivers and Yocto.

Here is the chronological roadmap of the courses offered by the academy:

Mastering Linux System Programming
Mastering Linux Kernel Programming
Embedded Linux Drivers & Yocto

My question is, where should I start learning to get a good grasp of the basics before moving on to Linux system programming? Your suggestions and tips would be very helpful in my learning journey.

14 Comments

2025/01/03
18:50 UTC

Novice programmer who wants to contribute to the kernel

Hey guys as the title suggests I am not a very experienced programmer and I am currently learning C. After that, I intend to read(and practise) the resources down below. However, since I am not very experienced I figured that I should make some projects before jumping into kernel dev... what would you guys recommend. I am thinking to make a small bootloader and then maybe a miniOS(these may not be tangible though hence, why I want your input). Is there a discord server for kernel dev and stuff like this? If this post was unclear I just basically just want to be pointed in the right direction after learning C.

P.S. I intend to contribute to the network stack/subsystem

Resources that I have been using(or will) so far:

https://www.udemy.com/course/c-programming-for-beginners (done)

https://www.udemy.com/course/advanced-c-programming-course (in the process)

C - Algorithmic Thinking_ A Problem-Based Introduction (need to read)

ldd3(need to read, kinda outdated tho but ppl say its still has good info)

Computer Networking A Top-Down Approach (new, good stuff in it and I need to read it)

https://www.amazon.com/Linux-Kernel-Programming-practical-synchronization/dp/1803232226 (very new book is based on the 6.1 kernel)

Please tell me if I need to correct this/improve this etc. Happy new year!!!

EDIT: I USUALLY DUALBOOT LINUX AND WINDOWS HOWEVER I HAVE GOTTEN SICK OF IT AND INSTEAD, I HAVE BEEN USING WINDOWS + WSL. IS THIS FINE FOR KERNEL DEV?

The only reason I am stuck on Windows is because of some games not being supported.

22 Comments

2025/01/01
09:29 UTC

Build and install the kernel

Hi all, I want to start changing/understanding the kernel code. I want to (at least for the initial few days) do every thing on a VM so that installing a kernel that I have made changes to, does not break my daily driver (Ubunutu). So the question really is, can I really start on a VM? I would make some changes, install the kernel and see it in flight.

TIA!

2 Comments

2025/01/01
01:58 UTC

Research paper CS

I'm a CS graduate(2023). I'm looking to contribute in open research opportunities. If you are a masters/PhD/Professor/ enthusiast, would be happy to connect.

0 Comments

2024/12/30
11:28 UTC

The Concurrency Issues of mod_timer and refcount_inc

static int ip_frag_reinit(struct ipq *qp)
{
  unsigned int sum_truesize = 0;

  if (!mod_timer(&qp->q.timer, jiffies + qp->q.fqdir->timeout)) {
	refcount_inc(&qp->q.refcnt);
	return -ETIMEDOUT;
  }
}

There are many places in the kernel where this is written, but since ref_inc is after mod_timer,

The timer may have already been executed on another CPU when mod_timer returns.

is there a concurrency issue between mod_timer and ref_inc ?

0 Comments

2024/12/30
02:14 UTC

Why VBAR_EL2 register changed on cortex-a710?

I'm using QEMU to simulate ARM cortex-a710, I found that the VBAR_EL2 register was changed during boot. Here is the QEMU command:

/home/alan/Hyp/qemu-9.2.0/build/qemu-system-aarch64 \
 -drive file=./build/tmp/deploy/images/qemu-arm64/demo-image-jailhouse-demo-qemu-arm64.ext4.img,discard=unmap,if=none,id=disk,format=raw \
 -m 1G \
 -serial mon:stdio \
 -netdev user,id=net \
 -kernel  /home/alan/Code/linux-6.1.90/out/arch/arm64/boot/Image \
 -append "root=/dev/vda mem=768M nokaslr" \
 -initrd ./build/tmp/deploy/images/qemu-arm64/demo-image-jailhouse-demo-qemu-arm64-initrd.img \
 -cpu cortex-a710 \
 -smp 16 \
 -machine virt,gic-version=3,virtualization=on,its=off \
 -device virtio-serial-device \
 -device virtconsole,chardev=con \
 -chardev vc,id=con \
 -device virtio-blk-device,drive=disk \
 -device virtio-net-device,netdev=net \
  -gdb tcp::1234 -S

I'm pretty sure that since I enabled virtualization so Linux kernel started at EL2, so __hyp_stub_vertors is used as a pre-installed VBAR_EL2 looked at the code arch/arm64/kernel/head.S

SYM_INNER_LABEL(init_el2, SYM_L_LOCAL)
mov_qx0, HCR_HOST_NVHE_FLAGS
msrhcr_el2, x0
isb

init_el2_state

/* Hypervisor stub */
adr_l x0, __hyp_stub_vectors
msr vbar_el2, x0  >>>>> original vaule
isb

mov_qx1, INIT_SCTLR_EL1_MMU_OFF

/*
 * Fruity CPUs seem to have HCR_EL2.E2H set to RES1,
 * making it impossible to start in nVHE mode. Is that
 * compliant with the architecture? Absolutely not!
 */
mrsx0, hcr_el2
andx0, x0, #HCR_E2H
cbzx0, 1f

/* Set a sane SCTLR_EL1, the VHE way */
msr_sSYS_SCTLR_EL12, x1
movx2, #BOOT_CPU_FLAG_E2H
b2f

1:
msrsctlr_el1, x1
movx2, xzr
2:
msrelr_el2, lr
movw0, #BOOT_CPU_MODE_EL2
orrx0, x0, x2
eret
SYM_FUNC_END(init_kernel_el)

I've debugged the code line by line using gdb, and I'm sure that the original value of VBAR_EL2 is :

(gdb) i r VBAR_EL2  
VBAR_EL2       0x411c0000          1092354048

BUT once the system booted, VBAR_EL2 changed to:

(gdb) i r VBAR_EL2
VBAR_EL2       0xffff800008012800  -140737354061824

By looking at the System.map file 0xffff800008012800 is __bp_harden_el1_vectors

ffff800008011d24 t el0t_32_fiq
ffff800008011eb8 t el0t_32_error
ffff80000801204c t ret_to_kernel
ffff8000080120b0 t ret_to_user
ffff800008012800 T __bp_harden_el1_vectors >>> changed to this address
ffff800008014344 T __entry_text_end
ffff800008014350 t arch_local_save_flags
ffff800008014360 t arch_irqs_disabled_flags

I have to add that if simulating with ARM cortex-a53, no such issue was found, VBAR_EL2 stays as 0x411c0000, So this is some bug between ARMv9 and Linux kernel 6.1.90?

2 Comments

2024/12/26
11:00 UTC

How to set a breakpoint at arm64 kernel startup entry point using QEMU and GDB

I want to set a breakpoint at the kernel startup entry point. It's an ARM64 QEMU setup, here is the command line of QEMU:

/home/alan/Hyp/qemu-9.2.0/build/qemu-system-aarch64 \
-drive file=./build/tmp/deploy/images/qemu-arm64/demo-image-jailhouse-demo-qemu-arm64.ext4.img,discard=unmap,if=none,id=disk,format=raw \
-m 1G \
-serial mon:stdio \
-netdev user,id=net \
-kernel /home/alan/Code/linux-6.1.90/out/arch/arm64/boot/Image \
-append "root=/dev/vda mem=768M nokaslr" \
-initrd ./build/tmp/deploy/images/qemu-arm64/demo-image-jailhouse-demo-qemu-arm64-initrd.img \
-cpu cortex-a53 \
-smp 16 \
-machine virt,gic-version=3,virtualization=on,its=off \
-device virtio-serial-device \
-device virtconsole,chardev=con \
-chardev vc,id=con \
-device virtio-blk-device,drive=disk \
-device virtio-net-device,netdev=net \
-gdb tcp::1234 -S

I want to break the kernel in the file arch/arm64/kernel/head.S at the entry point. I understand that a Physical address should be given to the gdb as MMU is not yet enabled at startup. But what is the physical address I should use, is the address of the kernel code that can be found in /proc/iomem?

root@demo:~# cat /proc/iomem 
00000000-03ffffff : 0.flash flash@0
04000000-07ffffff : 0.flash flash@0
08000000-0800ffff : GICD
080a0000-08ffffff : GICR
09000000-09000fff : pl011@9000000
  09000000-09000fff : 9000000.pl011 pl011@9000000
09010000-09010fff : pl031@9010000
  09010000-09010fff : rtc-pl031
09030000-09030fff : pl061@9030000
  09030000-09030fff : 9030000.pl061 pl061@9030000
0a003a00-0a003bff : a003a00.virtio_mmio virtio_mmio@a003a00
0a003c00-0a003dff : a003c00.virtio_mmio virtio_mmio@a003c00
0a003e00-0a003fff : a003e00.virtio_mmio virtio_mmio@a003e00
10000000-3efeffff : pcie@10000000
40000000-6fffffff : System RAM
  40210000-41b0ffff : Kernel code  > tried b *0x40210000, but no luck. 
  41b10000-4226ffff : reserved
  42270000-426bffff : Kernel data
  48000000-483f0fff : reserved
  48400000-484fffff : reserved
  6cf30000-6fdfffff : reserved
  6fe59000-6fe5afff : reserved
  6fe5b000-6fe5bfff : reserved
  6fe5c000-6fe6ffff : reserved
  6fe70000-6fe7dfff : reserved
  6fe7e000-6fffffff : reserved
4010000000-401fffffff : PCI ECAM
8000000000-ffffffffff : pcie@10000000

I can stop at start_kernel function, so my gdb and qemu settings are good I think.

Update with solution

I've found a solution to the question. Since `MMU` was not enabled at the early stage, we have to break at the physical address. But what's the right starting address(`PA`)? I found out that the physical address of the entry point is `0x40200000`. Instead of loading `vmlinux` with `gdb`, I'm using `add-symbol-file` to `vmlinux` and specifying the section name and its corresponding physical address.

add-symbol-file vmlinux -s .head.text 0x40200000 -s .text 0x40210000

Then b _text, _text is the entry point of the kernel by looking at the file `vmlinux.lds.S`

After this gdb can stop at the first line of the kernel:

(gdb) add-symbol-file vmlinux -s .head.text 0x40200000 -s .text 0x40210000
add symbol table from file "vmlinux" at
        .head.text_addr = 0x40200000
        .text_addr = 0x40210000
(y or n) y
Reading symbols from vmlinux...
(gdb) b _text
Breakpoint 1 at 0x40200000: file ../arch/arm64/kernel/head.S, line 60.
(gdb) c
Continuing.
    
Thread 1 hit Breakpoint 1, _text () at ../arch/arm64/kernel/head.S:60
60              efi_signature_nop                       // special NOP to identity as PE/COFF executable
(gdb) n
61              b       primary_entry                   // branch to kernel start, magic
(gdb) 
89              bl      preserve_boot_args
(gdb) n

12 Comments

2024/12/25
09:17 UTC

What's the lowest level at which a noob can configure screen orientation?

I've done desktop, terminal and then finally grub, but what's driving me nuts is that my laptop's bootloader still initially loads in portrait rather than landscape.

I've tried searching but anything containing gnu, grub, bootloader, etc... only turns up results for rotating the intermediary grub loading screen or the terminal.

Is there a way to rotate it on a kernel level so that anything on top of it is also rotated?

I'm at a point where I did a fresh, terminal-only install of Debian 12 so that I can install a DE after applying a solution and it'll be oriented correctly.

May be worth mentioning that the device is a mini-laptop with touchscreen (and the touch function is also skewed 90°), no idea what weird components they might have used to build this thing.

3 Comments

2024/12/19
09:23 UTC

Say I had exported `filldir` from fs/readdir.c, how can I hook it to hide paths using kprobes? Been losing sleep over this, any insight?

Hi, currently developing a kernel module that works like GoboHide, I've exported:

0000000000000000 T vfs_rmdir
0000000000000000 T vfs_unlink
0000000000000000 T vfs_symlink
0000000000000000 T compat_filldir
0000000000000000 T filldir64
0000000000000000 T filldir

I want to hook filldir & filldir64 to be able to hide paths, I've succesfully hooked the functions, but I'm doing something wrong, because when I try to hide a path, everything that calls filldir or filldir64 crashes, so, my PC is left unusable until I do a sysrq+REISUB.

Any help on this would be greatly appreciated, thanks!

Here's an example of having loaded the hidefs module, having correctly hooked filldir64, and then having set /home/anto/Downloads as hidden, then trying to run ls.

https://ibb.co/sWBVg2H

current hidefs.c (not pushed to github repo yet, due to the aforementioned isues)

https://paste.ajam.dev/p/BE0Yap

13 Comments

2024/12/17
03:29 UTC