您當前位置：首頁 > php開源 > php教程 > [置頂] 你的java/c/c++程序崩潰了？揭秘段錯誤（Segmentation fault）（3）

[置頂] 你的java/c/c++程序崩潰了？揭秘段錯誤（Segmentation fault）（3）

來源：程序員人生發布時間：2015-05-29 07:58:45 閱讀次數：5622次

前言

接上兩篇:

你的C/C++程序為何沒法運行？揭秘Segmentation fault （1）
你的C/C++程序為何沒法運行？揭秘Segmentation fault （2）

寫到這里，越跟，越發現真的是內核上很白，非1般的白。
但是既然是研究，就定住心，把段毛病弄到清楚明白。

本篇將作為終篇，來結束這個系列，也算是對段毛病和程序調試、尋覓崩潰緣由（通常不會給你那末完善的stackstrace和人性化的毛病提示）的再深入。

本篇使用到的工具或命令：

dmesg
strace
gdb
linux 內核3.10源碼

情形再現

上兩篇圍繞著1個這樣的問題進行展開：

//野指針
char ** p;
//零指針或空指針
p = NULL;
//段毛病（Segmentation Fault）
*p = (char *)malloc(sizeof(char));

問題代碼

為了本篇的可讀性，圍繞上述問題編織問題代碼：

#include "stdio.h"
#include "string.h"
#include "stdlib.h"


int main(int argc,char** args) {
    char * p = NULL;
    *p = 0x0;
}

段毛病

這里寫圖片描述

找出問題

第1步 strace 查信號描寫

上篇已介紹了gbd+coredump的方法來找到出現段毛病的代碼，本篇直接上strace：

strace -i -x -o segfault.txt ./segfault.o

得到以下信息：
這里寫圖片描述

可以知道：

1.毛病信號：SIGSEGV
3.毛病碼：SEGV_MAPERR
3.毛病內存地址：0x0
4.邏輯地址0x400507處出錯.

可以猜想:

程序中有空指針訪問試圖向0x0寫入而引發段毛病.

第2步 dmesg 查毛病現場

上dmesg：

dmesg

得到：
這里寫圖片描述

可知：

1.毛病類型：segfault ,即段毛病（Segmentation Fault）.
2.出錯時ip：0x400507
3.毛病號：6，即110

第3步搜集已知結論

這里 毛病號和ip 是關鍵，毛病號對比下面：

    /*
     * Page fault error code bits:
     *
     *   bit 0 ==    0: no page found   1: protection fault
     *   bit 1 ==    0: read access     1: write access
     *   bit 2 ==    0: kernel-mode access  1: user-mode access
     *   bit 3 ==               1: use of reserved bit detected
     *   bit 4 ==               1: fault was an instruction fetch
     */
    /*enum x86_pf_error_code {

        PF_PROT     =       1 << 0,
        PF_WRITE    =       1 << 1,
        PF_USER     =       1 << 2,
        PF_RSVD     =       1 << 3,
        PF_INSTR    =       1 << 4,
    };*/

對比后可知:

毛病號6 = 110 = (PF_USER | PF_WIRTE | 0).
即“用戶態”、“寫入型頁毛病 ”、“沒有與指定的地址相對應的頁”.

上面的信息與我們最初的推斷吻合.

現在，對目前已知結論進行概括以下：

1.毛病類型：segfualt ,即段毛病（Segmentation Fault）.

2.出錯時ip：0x400507

3.毛病號：6，即110

4.毛病碼：SEGV_MAPERR 即地址沒有映照到對象.

5.毛病緣由：對0x0進行寫操作引發了段毛病，緣由是0x0沒有與之對應的頁或叫映照.

第4步根據結論找到出錯代碼

上gdb：

gdb ./segfault.o

根據結論中的ip = 0x400507立即得到：

這里寫圖片描述

明顯，這驗證了我們的結論：

我們試圖將值0x0寫入地址0x0從而引發寫入未映照的地址的段毛病.

這里寫圖片描述并且我們找到了毛病的代碼stack.c的第9行：

查根溯源

明顯，我們不滿足于此，為何訪問了0x0會造成這個毛病從而讓程序崩潰？

第2篇已說了進程虛擬地址空間的問題，事實上我們進行寫入操作的時候，會引發虛擬地址到物理地址的映照，由于你終究要將數據（本篇是0x0，注意和我們的地址0x0辨別）寫入到物理內存中。

0x0是個邏輯地址，linux按頁式管理內存映照，0x0不會對應任何頁，那末內存中就不會有主頁，所以對其進行寫入就會引發1個缺頁中斷，這1部份由linux內存映照管理模塊(memory mapping,縮寫mm)處理。

缺頁毛病處理

1. __do_page_fault

缺頁落后入__do_page_fault流程,注意，這里為了盡可能減少篇幅，刪去了源代碼的1些注釋，而與我們有關的命中代碼都做了注釋：

/*
 * This routine handles page faults.  It determines the address,
 * and the problem, and then passes it off to one of the appropriate
 * routines.
 */
static void __kprobes
__do_page_fault(struct pt_regs *regs, unsigned long error_code./*  注意我們的毛病是6，即110 */)
{
    struct vm_area_struct *vma;
    struct task_struct *tsk;
    unsigned long address;
    struct mm_struct *mm;
    int fault;
    int write = error_code & PF_WRITE;
    unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
                    (write ? FAULT_FLAG_WRITE : 0);

    tsk = current;
    mm = tsk->mm;

    /* 這里會去取到我們的 地址=0x0 */
    /* Get the faulting address: */
    address = read_cr2();

    if (kmemcheck_active(regs))
        kmemcheck_hide(regs);
    prefetchw(&mm->mmap_sem);

    if (unlikely(kmmio_fault(regs, address)))
        return;

    if (unlikely(fault_in_kernel_space(address))) {
        //這里略去，不會命中
        /* ... */
        return;
    }

    //略去很多代碼
    // ...

retry:
        down_read(&mm->mmap_sem);
    } else {
        might_sleep();
    }

    vma = find_vma(mm, address);
    if (unlikely(!vma)) {

        /* 到這里處理 */
        bad_area(regs, error_code, address);
        //處理后返回
        return;
    }

    //略去很多代碼
    // ...
}

2. bad_area

其中的1個關鍵調用bad_area(regs, error_code, address);

static noinline void
bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
{
    /* 注意這里講毛病碼設為了SEGV_MAPERR */
    __bad_area(regs, error_code, address, SEGV_MAPERR);
}

可以明確

我們結論中的SEGV_MAPERR的出處.

這個類型就是沒法映照到對象的意思！看下面strace得到的東西,其中
si_code=SEGV_MAPERR.

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} --- +++ killed by SIGSEGV (core dumped) +++

最后會來到這里：

static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
               unsigned long address, int si_code)
{
    struct task_struct *tsk = current;

    /* 我們的毛病碼是6 = 110,PF_USER = 100,所以會進入這個if */
    if (error_code & PF_USER) {

        /* 關中斷 */
        local_irq_enable();

        //...略 

        if (address >= TASK_SIZE)
            error_code |= PF_PROT;

        /* 這里會將出錯信息打印 */
        if (likely(show_unhandled_signals))
            show_signal_msg(regs, error_code, address, tsk);

        tsk->thread.cr2     = address;
        tsk->thread.error_code  = error_code;
        tsk->thread.trap_nr = X86_TRAP_PF;

        /* 這里會強迫發送 SIGSEGV=段毛病 信號 */
        force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);

        return;
    }

    //...略
}

注意上面的代碼的兩個關鍵調用:

show_signal_msg  //用于打印出錯信息
force_sig_info_fault  //用于強迫發送信號

3. show_signal_msg

/*
 * Print out info about fatal segfaults, if the show_unhandled_signals
 * sysctl is set:
 */
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
        unsigned long address, struct task_struct *tsk)
{
    //...略

    /* 打印段毛病信息 -> /proc/kmsg */
    printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
        task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
        tsk->comm, task_pid_nr(tsk), address,
        (void *)regs->ip, (void *)regs->sp, error_code);

    print_vma_addr(KERN_CONT " in ", regs->ip);

    printk(KERN_CONT "
");
}

其中，打印段毛病的信息的代碼，就是我們使用dmesg得到的東西.

可以對照下我們的段毛病的圖：
這里寫圖片描述

4. force_sig_info_fault

最后就是發送信號了。

static void
force_sig_info_fault(int si_signo, int si_code, unsigned long address,
             struct task_struct *tsk, int fault)
{
    unsigned lsb = 0;
    siginfo_t info;

    info.si_signo   = si_signo;
    info.si_errno   = 0;
    info.si_code    = si_code;
    info.si_addr    = (void __user *)address;
    if (fault & VM_FAULT_HWPOISON_LARGE)
        lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault)); 
    if (fault & VM_FAULT_HWPOISON)
        lsb = PAGE_SHIFT;
    info.si_addr_lsb = lsb;

    /* 強迫發送SIGSEGV信號 */
    force_sig_info(si_signo, &info, tsk);
}

force_sig_info：

int
force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
{
    unsigned long int flags;
    int ret, blocked, ignored;
    struct k_sigaction *action;

    spin_lock_irqsave(&t->sighand->siglock, flags);

    /* 這里就指定信號的處理程序了 */
    action = &t->sighand->action[sig-1];

    //...略

    /* 必須強迫發送 */
    if (action->sa.sa_handler == SIG_DFL)
        /* 不需要遞歸式的發送SEGSIGV信號，所以清掉SIGNAL_UNKILLABLE */
        t->signal->flags &= ~SIGNAL_UNKILLABLE;

    // 發送
    ret = specific_send_sig_info(sig, info, t);
    spin_unlock_irqrestore(&t->sighand->siglock, flags);

    return ret;
}

上面的代碼告知我們，信號的處理程序如何被指定的，那末關于段毛病的信號SEGSIGV默許就是core dump.

5. core dump

到此，我們已可以拿到core dump，那末第2篇中找到引發段毛病的代碼的方法就能夠用了，這也是推薦的做法：

gdb ./segfault.o core.36054

這里寫圖片描述

是否是立便可知stack.c第9行的代碼*p = 0x0是罪魁罪魁了呢？

結語

到此，全部段毛病的探索就結束了，希望讀者和我1樣不枉此行。

列出幾種常見段毛病緣由：

1.數組越界

    int a[10] = {0,1};
    printf("%d",a[10000]);

2.零指針或空指針

    //本系列所用實例
    char * p = NULL;
    *p = 0x0;

3.懸浮指針

如果指針p懸浮，它指向的地址有可能能用，也有可能不能，你不知道那塊地址甚么時候被寫入，甚么時候被保護（mprotect）.
如果被保護為可讀，你寫就出現段毛病！

4.訪問權限，非法訪問

參見3.

5.多線程對同享指針變量操作

不但c/c++,android中、java程序中有可能也會出現jvm崩潰哦，那檢查下多線程的同享變量吧！

如有毛病，請不吝賜教.

生活不易，碼農辛苦
如果您覺得本網站對您的學習有所幫助,可以手機掃描二維碼進行捐贈
程序員人生

------分隔線----------------------------

上一篇 HashMap的小優化

下一篇 Bad Cowtractors.(POJ-2377)

分享到:

------分隔線----------------------------

為碼而活

積分：4237

15粉絲

7關注

欄目熱點

多多色-多人伦交性欧美在线观看-多人伦精品一区二区三区视频-多色视频-免费黄色视屏网站-免费黄色在线

[置頂] 你的java/c/c++程序崩潰了？揭秘段錯誤（Segmentation fault）（3）

前言

本篇使用到的工具或命令：

情形再現

問題代碼

段毛病

找出問題

第1步 strace 查信號描寫

第2步 dmesg 查毛病現場

第3步 搜集已知結論

第4步 根據結論找到出錯代碼

查根溯源

缺頁毛病處理

1. __do_page_fault

2. bad_area

3. show_signal_msg

4. force_sig_info_fault

5. core dump

結語

第3步搜集已知結論

第4步根據結論找到出錯代碼