From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-soc+bounces-2553-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 2CCA4138359
	for <garchives@archives.gentoo.org>; Sun, 12 Jul 2020 22:58:40 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 54E9AE0849;
	Sun, 12 Jul 2020 22:58:39 +0000 (UTC)
Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 14EF7E0849
	for <gentoo-soc@lists.gentoo.org>; Sun, 12 Jul 2020 22:58:38 +0000 (UTC)
Received: by mail-pl1-x631.google.com with SMTP id t6so1476390plo.3
        for <gentoo-soc@lists.gentoo.org>; Sun, 12 Jul 2020 15:58:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:from:date:message-id:subject:to:cc
         :content-transfer-encoding;
        bh=pXnOEfZjqFHuDfxuDimX9YuoG4VnqSohkyKIMYOirj0=;
        b=BO2oyNmAasa8Ach2nijRKvV54GRoRfy2nhJ5bGlE3haS+DXWwb+StIJfCqTY+Ojbqk
         MiLy2/IXaVYg63bjP7Z3deqwjeqRVCHG/EKuxQlCXBk1nmVZsVH0JBssXFDWvw9hR7gk
         sR7ng9TqvMHxw2jK5N3FPH7KAOdwdMwX1PP0s32dUXOXuiIaOUxDTnXtPHJrrm13mWNn
         eoHwHqiy36cbbCPRocg0xZvm5oZwGjsYa/wKwmLwoeUKW+1W+WNKy3d4QuW7eLkZXsz1
         jqBG9Mh0xPkLzaqHSEEg8MueoWbtUpPKqcU0PBWNCsDjgEyLpLnuvPBmTm1lf9Sk42yn
         QjgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc
         :content-transfer-encoding;
        bh=pXnOEfZjqFHuDfxuDimX9YuoG4VnqSohkyKIMYOirj0=;
        b=NZK4E0USNwtpSb7L9zmALHTMUicc7YVBzM5mZGqFHaMjBWbMEidVZwrWIWi9BkEyG8
         w1x2DWJOdrNntvmrzqLrko3zMqYjSlhxViqMFyEV3V5Ffq1j806frqpTh4vQ+KsZzgRT
         yUQIyaUfheVMUMXj6pwYUdWBzPwElxBGajHQ+Air7/K9bAslLWoU++LdymA3u6HoX4vq
         625rMD1drBWe0jFlAlsXs5lc5RJtY6/jc+NhPIPnszuEMro1oIGncz8/uyhKpgTvKt9h
         mBJEMpqB9icHKx424pE+8RuLXz4QAYZkDv0s0yqtBrBUsnmsW3bmumML/cxfV0FzpenW
         b6aw==
X-Gm-Message-State: AOAM532UV7cCZ+3HbMkhWnIwWgEirswLtbJCEnB6u0g3GL+bEIw6A5iQ
	DYed/Dl+rk00aPwy7YJLQLLu12MTaM0WndN0haFaKYe3ZB0=
X-Google-Smtp-Source: ABdhPJw1s5pu2/o1DhISJqNJohewQ0impfLaj2b3QFpJKxCROAItiJybIicXNxmw9WstLf5TCrrn2eDunDa0wMnKRMg=
X-Received: by 2002:a17:902:6b43:: with SMTP id g3mr272127plt.319.1594594717708;
 Sun, 12 Jul 2020 15:58:37 -0700 (PDT)
Precedence: bulk
List-Post: <mailto:gentoo-soc@lists.gentoo.org>
List-Help: <mailto:gentoo-soc+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-soc+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-soc+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-soc.gentoo.org>
X-BeenThere: gentoo-soc@lists.gentoo.org
Reply-to: gentoo-soc@lists.gentoo.org
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
MIME-Version: 1.0
From: Ahmed Soliman <ahmedsoliman0x666@gmail.com>
Date: Mon, 13 Jul 2020 00:57:45 +0200
Message-ID: <CAAGnT3Y3vu1C+_REJ9Y_PEdp7qGjO45qViBSSnOJt75ao_9yOQ@mail.gmail.com>
Subject: [gentoo-soc] Weekly Progress: Porting relibc
To: gentoo-soc@lists.gentoo.org
Cc: Luca Barbato <lu_zero@gentoo.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Archives-Salt: c3d1aa99-2dc2-4991-bdb3-318d7dba6601
X-Archives-Hash: 4aea39025670e6bf2dccc364fb5e061a

Hello,

This is the weekly progress report,

Last week I was trying to implement dlopen, dlclose, and dlsym, and
they worked as I expected for most of the cases, however there seemed
to be a complicated bug that only appeared in complex projects. That
bug appears in the form of memory corruptions at random places. The
problem here is that rust should in theory guarantee that this would
never happen unless you are doing unsafe code, even more so, the place
where I could observe this bug was safe code.

Specifically that was the snipped that puzzled me:

let mut data =3D Vec::new(); let mut file =3D File::open(&path_c,
flags).map_err(/*irrelevant*/)?; file.read_to_end(&mut
data).map_err(/*irrelevant*/)?; println!("load_recursive: name: {}
\"{}\"", (name).as_ptr() as usize, name); file.read_to_end(&mut
data).map_err(/*irrelevant*/); println!("load_recursive: name: {}
\"{}\"", (name).as_ptr() as usize, name);

and the output looked something like that
load_recursive: name: 93869686340272
"/home/oddcoder/sources/gcc/build/./gcc/liblto_plugin.so"
load_recursive: name: 93869686340272 "=EF=BF=BDo =EF=BF=BDo=EF=BF=BDo"

that code above is part of dlopen subroutine invocation.

At this stage I got stoned for little while with absolutely no idea
what to do, then I tried to write my own program that is statically
linked against that toy binary so I can observe what really happens.

so I did that with this toy program
//x86_64-lfs-linux-relibc-gcc code1.c -o code1 -llto_plugin
#include <stdio.h>
#include <unistd.h>
int main() {
        printf("hello world\n");
}
And it worked without crashing or anything so at this point I knew
that the problem is not with dynamic linking  at all.

So I created attempt 2
//x86_64-lfs-linux-relibc-gcc crasher.c -o crasher
#include <dlfcn.h>
int main() {
        void *lib1 =3D
dlopen("/home/oddcoder/sources/gcc/build/./gcc/liblto_plugin.so", 0);
        printf("Hello world\n");
}

And this program did segfault at stdlib::exit in a way that makes me
believe that is memory corruption,
The thing is the difference between dlopen and code base used by ld.so
is basically nothing, in fact it is the exact same code.
The only difference is that dlopen is usually called after libc.so is loade=
d.

My mentor (Luca Barbato) suggested using Rust GlobalAlloc to trace
memory allocations and see where the corruption is happening. Although
I understood the general idea, I still didn't know how to use it.
However, I found it used in relibc, not to trace memory allocation but
to implement relibc's custom memory allocator. So I thought maybe the
issue is there.

Under the hood for linux, Relibc is using dlmalloc with mmap disabled.
which means that all the memory allocations are done via brk()
syscall.

That made me believe that perhaps this is the root cause of the issue.
I also Learned that there is a global data structure used to do
bookkeeping and maintain a list of free memory chunks, I assumed that
if 2 instances of dlmalloc are running on the same heap, they would
probably corrupt it.

So to verify my assumption, I disabled normal malloc, and used mspaces
in dlmalloc which is basically a way for sharing the book keeping data
structure.

After being half implemented. (ld.so has it's own bookkeeping
structure which is isolated from relibc) the crasher no longer crashes
for some reason. But still full implementation would share the
bookkeeping structure between ld.so and relibc.
(note that the issue was that the memory was shared but each of them
maintaining different list of free/used chunk on the same memory,
however what would be ideal is that they would share the memory and
the list of used/freed memory)

I tried it on the real gdb code base and unfortunately it crashed but
due to completely different reasons (early crash). I did in fact skip
multiple details, I was inserting printlns everywhere and hacking
through the code trying to just get it working. There were things (for
example : mmap ) disabled in dlmalloc but was required for
for mspaces to work.

Now that I know that the basic idea worked for the memory corruption
proof of concept, I would just git reset --hard and try to implement
the fix as clean as possible.

What slightly worries me is that I could be totally unlucky and the
crash in gcc and segfault in the POC I created end up being completely
different bugs, and I will have to restart wondering what is causing
memory corruptions.


Thanks.
Ahmed.