復盤:從C++ STL源碼推演程序中的bug

這幾天寫程序發現有個bug,最后看底層才解決,寫篇blog 復盤一下。

具體表現就是服務端軟件接受請求時,一些值在首次請求是正確的,以后請求時都成了非隨機固定值。

其實這個場景比較常見。有人會說,軟件帶了狀態。

既然第一次是正確的,說明程序本身沒問題,問題在各種狀態標記,或者說可能作為狀態的值的生命周期上。

這一想法直接導致查bug思路進入誤區。

看起來是帶了狀態,所以我把相關的構造析構,各種涉及到對象生命周期的代碼都檢查調試了一遍,沒發現問題。

由于代碼不公開,這里省略上層軟件的調試,直接用gdb顯示最終問題。

[qianzichen@dev ~]$ ps -ef | grep -E '$regex...' | awk '{print $2}'
25497
[qianzichen@dev ~]$ gdb -p 25497 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 25497
...
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
...
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
...
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f5c157fb700 (LWP 25531)]
[New Thread 0x7f5c161fc700 (LWP 25530)]
[New Thread 0x7f5c16bfd700 (LWP 25529)]
[New Thread 0x7f5c175fe700 (LWP 25528)]
[New Thread 0x7f5c17fff700 (LWP 25527)]
[New Thread 0x7f5c2cdfa700 (LWP 25526)]
[New Thread 0x7f5c2d7fb700 (LWP 25525)]
[New Thread 0x7f5c2e1fc700 (LWP 25524)]
[New Thread 0x7f5c2ebfd700 (LWP 25523)]
[New Thread 0x7f5c2f5fe700 (LWP 25522)]
[New Thread 0x7f5c2ffff700 (LWP 25521)]
[New Thread 0x7f5c48dfa700 (LWP 25520)]
[New Thread 0x7f5c497fb700 (LWP 25519)]
[New Thread 0x7f5c4a1fc700 (LWP 25518)]
[New Thread 0x7f5c4abfd700 (LWP 25517)]
[New Thread 0x7f5c4b5fe700 (LWP 25516)]
[New Thread 0x7f5c4bfff700 (LWP 25515)]
[New Thread 0x7f5c50f73700 (LWP 25514)]
[New Thread 0x7f5c51974700 (LWP 25513)]
[New Thread 0x7f5c5d3d2700 (LWP 25500)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
...
(gdb) b exit
Breakpoint 1 at 0x3ec7a35d40
(gdb) b abort
Breakpoint 2 at 0x3ec7a33f90
(gdb) b src/path/to/target_file/file.cc:...
Breakpoint 3 at 0x7f5c5ed8042d: file src/path/to/target_file/file.cc, line ....
(gdb) c
Continuing.
[Switching to Thread 0x7f5c16bfd700 (LWP 25529)]

Breakpoint 3, (omitted...)
(gdb) p ctx
$1 = {px = 0x7f5c080008e0, pn = {pi_ = 0x7f5c08001430}}
(gdb) p ctx.px.a_member_instance
$2 = {
...
too large to display, omitted...
...
}
(gdb) set print pretty on
(gdb) p ctx.px.dbg_data_
$3 = {
  url_param_string = {
    static npos = 18446744073709551615, 
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x7f5c6beb5578 "zichen"
    }
  }, 
  request = 0x0, 
  search_context = 0x0, 
  xxx = {
...
    yyy = {
...
      }, <No data fields>}, 
...
  }, 
  doc_response_str = {
    static npos = 18446744073709551615, 
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x7f5c6beb5578 "zichen"
    }
  }, '
...
too large to display, omitted...
...
}
(gdb)

如上,vector中的空string、map中的string、隨處定義的string或者其他容器其他方式訪存的string,_M_p指針均指向同一地址,值為"zichen",是首次請求傳入服務端的值。

所以最后問題定位于,該類的c_str為定值定址。

RTFS(Read The Friendly Source),直接打開當前版本的C++源碼:

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/string
...
// You should have received a copy of the GNU General Public License and
// a copy of the GCC Runtime Library Exception along with this program;
// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
// <http://www.gnu.org/licenses/>.

/** @file include/string
 *  This is a Standard C++ Library header.
 */

//
// ISO C++ 14882: 21  Strings library
//

#ifndef _GLIBCXX_STRING
#define _GLIBCXX_STRING 1

#pragma GCC system_header

#include <bits/c++config.h>
#include <bits/stringfwd.h>
#include <bits/char_traits.h>  // NB: In turn includes stl_algobase.h
#include <bits/allocator.h>
#include <bits/cpp_type_traits.h>
#include <bits/localefwd.h>    // For operators >>, <<, and getline.
#include <bits/ostream_insert.h>
#include <bits/stl_iterator_base_types.h>
#include <bits/stl_iterator_base_funcs.h>
#include <bits/stl_iterator.h>
#include <bits/stl_function.h> // For less
#include <ext/numeric_traits.h>
#include <bits/stl_algobase.h>
#include <bits/range_access.h>
#include <bits/basic_string.h>
#include <bits/basic_string.tcc>
...

看stringfwd.h

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/bits/stringfwd.h
...
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION

  /**
   *  @defgroup strings Strings
   *
   *  @{ 
  */

  template<class _CharT>
    struct char_traits;

  template<typename _CharT, typename _Traits = char_traits<_CharT>,
           typename _Alloc = allocator<_CharT> >
    class basic_string;

  template<> struct char_traits<char>;

  /// A string of @c char
  typedef basic_string<char>    string;   

#ifdef _GLIBCXX_USE_WCHAR_T
  template<> struct char_traits<wchar_t>;

  /// A string of @c wchar_t
  typedef basic_string<wchar_t> wstring;
...

如上,可以看出string類型為basic_string<char>類型,basic_string是一個模板類。

現看basic_string實現

[qianzichen@dev ~]$ vi /usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h

找到c_str函

/**
       *  @brief  Swap contents with another string.
       *  @param __s  String to swap with.
       *
       *  Exchanges the contents of this string with that of @a __s in constant
       *  time.
      */
      void
      swap(basic_string& __s);

      // String operations:
      /**
       *  @brief  Return const pointer to null-terminated contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      c_str() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }

      /**
       *  @brief  Return const pointer to contents.
       *
       *  This is a handle to internal data.  Do not modify or dire things may
       *  happen.
      */
      const _CharT*
      data() const _GLIBCXX_NOEXCEPT
      { return _M_data(); }

繼續看

 private:
      // Data Members (private):
      mutable _Alloc_hider      _M_dataplus;

      _CharT*
      _M_data() const
      { return  _M_dataplus._M_p; }

      _CharT*
      _M_data(_CharT* __p)
      { return (_M_dataplus._M_p = __p); }

所以返回的是 _M_dataplus 成員的 _M_p 成員。找到_Alloc_hider結構。

...
      // Use empty-base optimization: http://www.cantrip.org/emptyopt.html
      struct _Alloc_hider : _Alloc
      {    
        _Alloc_hider(_CharT* __dat, const _Alloc& __a) 
        : _Alloc(__a), _M_p(__dat) { }

        _CharT* _M_p; // The actual data.
      };   

    public:
...

_Alloc_hider 構造函的__dat參數初始化_M_p成員。其成員類型_CharT為實例化string類型時,basic_string模板類傳入的類型。

現看basic_string的構造函

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ }
#endif
...

可能有兩種委托構造,當前環境使用哪種呢?直接確定_GLIBCXX_FULLY_DYNAMIC_STRING的值不簡單。換一種方式,直接改源碼如下。在預處理宏分支里寫一些正常compiler不會定義的符號,如heihei(嘿嘿...)

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ heihei }
#endif
...

再單獨寫一個UT。簡單到只用string相關,復雜到要到某個解析階段(僅預處理還不能保證這塊代碼被編譯)。

[qianzichen@dev ~]$ cat heihei.cc 
#include <string>
[qianzichen@dev ~]$

如上,只寫一行,后編譯。

[qianzichen@dev ~]$ /usr/local/gcc-4.8.5/bin/g++ heihei.cc 
/usr/lib/../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
collect2: error: ld returned 1 exit status
[qianzichen@dev ~]$

如此,說明使用的是上面那個委托構造函。

...
      // NB: We overload ctors in some cases instead of using default
      // arguments, per 17.4.4.4 para. 2 item 2.

      /**
       *  @brief  Default constructor creates an empty string.
       */
      basic_string()
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
#else
      : _M_dataplus(_S_construct(size_type(), _CharT(), _Alloc()), _Alloc()){ }
#endif
...

如不確定可分支驗證,改源碼如上,再編譯。

[qianzichen@dev ~]$ /usr/local/gcc-4.8.5/bin/g++ heihei.cc 
In file included from /usr/local/gcc-4.8.5/include/c++/4.8.5/string:52:0,
                 from heihei.cc:1:
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h: In constructor ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string()’:
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h:439:62: error: ‘heihei’ was not declared in this scope
       : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
                                                              ^
/usr/local/gcc-4.8.5/include/c++/4.8.5/bits/basic_string.h:439:69: error: expected ‘;’ before ‘}’ token
       : _M_dataplus(_S_empty_rep()._M_refdata(), _Alloc()) { heihei }
                                                                     ^
[qianzichen@dev ~]$

如上,這次在源碼中報錯。至此確定環境下的basic_string的構造函委托的是上面較簡單的那個。

_S_empty_rep()._M_refdata() 為上文所提入參__dat

看_S_empty_rep結構

...
void
      _M_leak_hard();

      static _Rep&
      _S_empty_rep()
      { return _Rep::_S_empty_rep(); }

    public:
...

返回static上的_Rep類型實例的引用。具體為_Rep類型的靜態函_S_empty_rep返回值。

直接看_Rep結構

...
      struct _Rep : _Rep_base
      {
        // Types:
        typedef typename _Alloc::template rebind<char>::other _Raw_bytes_alloc;

        // (Public) Data members:

        // The maximum number of individual char_type elements of an
...
      static _Rep&
        _S_empty_rep()
        {
          // NB: Mild hack to avoid strict-aliasing warnings.  Note that
          // _S_empty_rep_storage is never modified and the punning should
          // be reasonably safe in this case.
          void* __p = reinterpret_cast<void*>(&_S_empty_rep_storage);
          return *reinterpret_cast<_Rep*>(__p);
        }

        bool
        _M_is_leaked() const
        { return this->_M_refcount < 0; }
...

可見,靜態函_S_empty_rep返回一個static上的_Rep類型實例的引用。

這里開發者 shutup 了 compiler的strict-aliasing warnings

reinterpret_cast 為運算對象的位模式提供較低層次上的重新解釋,類型改變了,compiler未給出警告等提示信息,當_S_empty_rep用一個_S_empty_rep_storage的地址返回引用時,顯式聲稱這個轉換合法。使用返回的引用時,就認定它的值為_Rep類型。

舊式類型轉換,如

char *pc = (char *)ip;

效果與使用reinterpret_cast一樣,如文后最小復現代碼。

返回的地址為_S_empty_rep_storage的地址,查找該符號

...
        // m = ((npos - sizeof(_Rep))/sizeof(CharT)) - 1
        // In addition, this implementation quarters this amount.
        static const size_type  _S_max_size;
        static const _CharT     _S_terminal;

        // The following storage is init'd to 0 by the linker, resulting
        // (carefully) in an empty string with one reference.
        static size_type _S_empty_rep_storage[];
...

為static上的數組,獨立于類型實例,該數據段在Linker鏈接階段初始化為0。

這就解釋了string的c_str(),為定值定址的問題。

整個程序一定某處訪存了該址。致使這段內存被污染。

至此問題確定,繼續查找服務端bug。

隨手定義一個string,在我的代碼中二分法查找bug區域。最終縮小到請求摘要之后,進入摘要模塊,繼續查找...,終于找到是在某一次序列化輸出中,直接取了某個string的c_str址,有寫入操作。作者應該是想直接利用這個buffer。

改為程序自定義buffer之后,問題解決。

最小復現代碼:

[qianzichen@dev ~]$ vi heihei.cc
#include <string>
#include <iostream>

#include <string.h>

int main() {
  std::string test1("this is a test");
  char *ptest1 = (char *)test1.c_str();

  strncpy(ptest1, "hug you", 8);
  std::cout << " ptest1 = " << ptest1 << std::endl;

  std::string test2;
  const char *ptest2 = test2.c_str();

  std::string test3;
  const char *ptest3 = test3.c_str();

  std::cout << " ptest2 = " << ptest2 << std::endl;
  std::cout << " ptest3 = " << ptest3 << std::endl;

  std::cout << " address of ptest1 = " << (unsigned long)ptest1 << std::endl;
  std::cout << " address of ptest2 = " << (unsigned long)ptest2 << std::endl;
  std::cout << " address of ptest3 = " << (unsigned long)ptest3 << std::endl;

  return 0;
}

執行

[qianzichen@dev ~]$ ./a.out 
 ptest1 = hug you
 ptest2 = hug you
 ptest3 = hug you
 address of ptest1 = 261138363096
 address of ptest2 = 261138363096
 address of ptest3 = 261138363096
[qianzichen@dev ~]$

更明顯地打印出是同址同值。


復盤整個debug過程,需要反思的是,首先要確認,即“軟件首次行為是正確的”這個條件是否完全正確,否則方向不對容易進入誤區。

在開發的時候想到過折衷,避開問題,但是核心問題不解決是不行的。在高性能,高并發場景下更是如此,還須“不破樓蘭終不還”。


Linkerist
2019年1月24日于酒仙橋

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容