Categorías
Programación

Web scraper with Scrapy

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import scrapy
import re

title_regex       = r'Letra de\s+([a-zA-Z0-9áéíóúñü_,!¡¿?"() ]+)\s-'
empty_lines_regex = r"^\s+$"
tabs_regex        = r"^[\n\t]+"

class ConchaPiquerSpider(scrapy.Spider):
    name = 'conchitabot'
    allowed_domain = ['http://www.coveralia.com']
    start_urls = ['http://www.coveralia.com/letras-de/concha-piquer.php']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
    }
    BASE_URL = 'http://www.coveralia.com'
    def parse(self, response):
        lyric_links = response.css(".lista_uno li a::attr(href)").extract()
        for link in lyric_links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_lyric)
        lyric_names_raw = response.css(".lista_uno li a::text").extract()


    def parse_lyric(self,response):
        raw_titles = response.css("h1").extract()
        for raw_title in raw_titles:
            match = re.search(title_regex, raw_title.encode("utf-8"))
            if match:
                title = match.group(1)
        raw_text = response.css("#HOTWordsTxt::text").extract()
        encoded_text = []
        single_string = ""
        for item_text in raw_text:
            single_string = single_string + item_text

        lyric = self.clean_lyric(single_string)

        text_file = open("./letras/" + title + ".txt", "w")
        text_file.write(lyric)
        text_file.close()

    def clean_lyric(self,dirty_str):
        encoded = dirty_str.encode("utf-8")
        no_spaces = re.sub(r"^\s+", '', encoded)
        no_tabs = re.sub(r"[\n\t]+", '', no_spaces)
        return no_tabs
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import re

def get_sorted_files(Directory):
    filenamelist = []
    for root, dirs, files in os.walk(Directory):
        for name in files:
            fullname = os.path.join(root, name)
            filenamelist.append(fullname)
    return sorted(filenamelist)

text = "<head><meta charset='utf-8'>"
folder = "./letras/"
files = get_sorted_files(folder)
for filename in files:
    filebase = re.sub(folder, "", filename)
    filebase = re.sub("\..*$", "", filebase)
    with open(filename,'r') as f:
        text = text + "<h1>" + filebase + "</h1><pre>" + f.read() + "</pre>"

unified = open("unified.html", "w")
unified.write(text)
unified.close()
Categorías
Teoría de la señal

Basis in a vector space

A vector space basis is the skeleton from which a vector space is built. It allows to decompose any signal into a linear combination of simple building blocks, namely, the basis vectors. The Fourier Transform is just a change of basis.

A vector basis is the linear combination of a set of vector that can write any vector of the space.

\[ w^{(k)} \leftarrow \text{basis} \]

The canonical basis in \(\mathbb{R}^2\) are:

\(e^{(0)} = [1, 0]^T \ \ e^{(1) } = [0,1]^T \)

Nevertheless, there are more basis for \(\mathbb{R}^2\):

\(e^{(0)} = [1, 0]^T \ \ e^{(1) } = [1,1]^T \)

This former basis is not linearly independent as information of \(e^{(0)}\) is inside \(e^{(1)}\).

Formal definition

H is a vector space.

W is a set of vectors from H such that \(W = \left\{ w^{(k)}  \right\} \)

W is a basis of H if:

  1. We can write \( \forall x \in H\): \( x = \sum_{k=0}^{K-1} \alpha_k w^{(k)}, \ \ \alpha_k \in \mathbb{C} \)
  2. \( \alpha_k  \) are unique, namely, there is linear independence in the basis, as a given point can only be expressed in a unique combination of the basis.

Orthogonal basis are those which inner product returns 0:

\( \left \langle w^{(k)}, w^{(n)} \right \rangle = 0, \ \ \text{for } k \neq n \)

In addition, if the self inner product of every basis element return 1, the basis are orthonormal.

How to change the basis?

An element in the vector space can be represented with a new basis computing the projection of the current basis in the new basis. If \(x\) is a vector element and is represented with the vector basis \(w^{(K)}\) with the coefficients \(a_k\), it can also be represented as a linear combination of the basis \(v^{(k)}\) with the coefficients \( \beta_k\). In a mathematical notation:

\[ x = \sum_{k=0}^{K-1} \alpha_k w^{(k)} = \sum_{k=0}^{K-1} \beta_k v^{(k)} \]

If \(\left\{ v^{(k)} \right\}\) is orthonormal, the new coefficients \(\beta_k\) can be computed as a linear combination of the previous coefficients and the projection of the new basis over the original one:

\[\beta_h = \left \langle v^{(h)}, x \right \rangle = \left \langle v^{(h)}, \sum_{k=0}^{K-1} \alpha_k w^{(k)} \right \rangle = \sum_{k=0}^{K-1} \alpha_k \left\langle v^{(h)}, w^{(k)} \right \rangle \]

This operation can also be represented in a matrix form as follows:

\[ \beta_h = \begin{bmatrix}
c_{00} & c_{01} & \cdots & c_{0\left(K-1 \right )}\\
& & \vdots & \\
c_{\left(K-1 \right )0} & c_{\left(K-1 \right )01} & \cdots & c_{\left(K-1 \right )\left(K-1 \right )}
\end{bmatrix}\begin{bmatrix}
\alpha_0 \\
\vdots \\
\alpha_{K-1}
\end{bmatrix} \]

This operation is widely used in algebra. A well-known example of a change of basis could be the Discrete Fourier Transform (DFT).

Categorías
Sin categoría

Inner product in vector space

The inner product is an operation that measures the similarity between vectors.  In a general way, the inner product could be defined as an operation of 2 operands, which are elements of a vector space. The result is a scalar in the set of the complex numbers:

\[ \left \langle \cdot, \cdot \right \rangle : V \times V \rightarrow \mathbb{C}  \]

Formal properties

For \(x, y, z \in V\) and \(\alpha \in \mathbb{C}\), the inner product must fulfill the following rules:

To be distributive to vector addition:

\( \left \langle x+y, z \right \rangle = \left \langle x, z \right \rangle + \left \langle y, z \right \rangle \)

Conmutative with conjugate (applies when vectors are complex):

\( \left \langle x,y \right \rangle  = \left \langle y, x \right \rangle^* \)

Distributive respect scalar multiplication:

\(  \left \langle \alpha x, y \right \rangle =  \alpha^* \left \langle x, u \right \rangle \)

\(  \left \langle  x, \alpha y \right \rangle =  \alpha \left \langle x, u \right \rangle \)

The self inner product must be necessarily a real number:

\(  \left \langle  x, x \right \rangle \geq 0 \)

The self inner product can be zero only when the element is the null element:

\( \left \langle x,x \right \rangle = 0 \Leftrightarrow x = 0 \)

Inner product in \(\mathbb{R}^2 \)

The inner product in \( \mathbb{R}^2\) is defined as follows:

\( \left \langle x, y \right \rangle = x_0 y_0 + x_1 y_1 \)

In self inner product represents the squared norm of the vector:

\( \left \langle x, x \right \rangle = x^2_0 + y^2_0 = \left \| x \right \|^2 \)

Inner product in finite length signals

In this case, the inner product is defined as:

\[ \left \langle x ,y \right \rangle = \sum_{n= 0}^{N-1} x^*[n] y[n] \]

Categorías
Sin categoría

Properties of vector spaces

Vector spaces must meet the following rules:
Addition to be commutative:
\( x + y = y + x \)

Addition to be distributive:
\( (x+y)+z = x + (y + z) \)

Scalar multiplication to be distributive with respect to vector addition:
\( \alpha\left(x + y \right) = \alpha x + \alpha y\)

Scalar multiplication to be distributive with respect to vector the addition of field scalars:
\( \left( \alpha + \beta \right) x = \alpha x + \beta y \)

Scalar multiplication to be associative:
\( \alpha\left(\beta x \right) = \left(\alpha \beta \right) x \)

It must exist a null element:
\( \exists 0 \in V \ \ | \ \ x + 0 = 0 + x = x \)

It must exist an inverse element for every element in the vector space:
\( \forall x \in V \exists (-x)\ \ | \ \ x + (-x) = 0\)