wiki_music.library.parser module

Warning

Documentation is stil under construction some things might not be up to date.

library.parser.WikipediaRunner

Toplevel Wikipedia Parser class.

Inherits all other parser subclasses. This is the class that is intended for user interaction. Its methods know how to run the parser in order to produce meaningfull results.

Warning

This is the only parser class that is ment for user interaction. Calling its subclasses directly might result in unexpected behaviour.

param album:album name
type album:str
param albumartist:
 band name
type albumartist:
 str
param work_dir:directory with music files
type work_dir:str
param with_log:If parser should output its progress to logger, only for CLI mode
type with_log:bool
param GUI:if True - assume app is running in GUI moden if False - assume app is running in CLI mode
type GUI:bool
param protected_vars:
 whether to initialize protected variables or not
type protected_vars:
 bool
param multi_threaded:
 whether to run some parts of code in threads
type multi_threaded:
 bool
wiki_music.library.parser.WikipediaRunner.ALBUM

String with album name.

This attribute can be protected from reseting in __init__ method.

Type:str
wiki_music.library.parser.WikipediaRunner.ALBUMARTIST

String with band name.

This attribute can be protected from reseting in __init__ method.

Type:str
wiki_music.library.parser.WikipediaRunner.ARTIST

Each entry in list holds list of artists for one track.

Type:List[List[str]]
wiki_music.library.parser.WikipediaRunner.COMPOSER

Each entry in list holds list of composers for one track.

Type:List[List[str]]
wiki_music.library.parser.WikipediaRunner.COVERART

Holds cover art read into memory as a bytes object.

Type:bytes
wiki_music.library.parser.WikipediaRunner.DATE

Album release date.

Type:str
wiki_music.library.parser.WikipediaRunner.DISCNUMBER

List containing dicsnumber for every track.

Type:List[int]
wiki_music.library.parser.WikipediaRunner.FILE

List of files on local disk corresponding to each track.

Type:List[str]
wiki_music.library.parser.WikipediaRunner.GENRE

Holds the album genre selected automatically or by user.

If genres is a list with one item, than it is that item, otherwise user input is required to select from genres list.

Type:str
wiki_music.library.parser.WikipediaRunner.LYRICS

List of lyrics coresponding to each track.

Type:List[str]
wiki_music.library.parser.WikipediaRunner.NLTK_names

Use nltk to extract person names from sections of wikipedia page.

See also

wiki_music.constants.parser_const.PERSONNEL_SECTIONS
these sections of the page + track_listing are passed to nltk to look for names
Raises:wiki_music.utilities.exceptions.NoNames2ExtractException – vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.
wiki_music.library.parser.WikipediaRunner.TITLE

List of track names.

Type:List[str]
wiki_music.library.parser.WikipediaRunner.TRACKNUMBER

List of numbers for each track.

Type:List[str]
wiki_music.library.parser.WikipediaRunner.TYPE

List of track types.

Type:List[str]
wiki_music.library.parser.WikipediaRunner._preload_id

Get unique id for each preload instance.

Id is a three tuple of album, band and offline_debug flag

Type:Tuple[str, str, str]
wiki_music.library.parser.WikipediaRunner.bracketed_types

Takes _bracketed_types list, populates and returns it.

See also

wiki_music.utilities.parser_utils.bracket()
function used to append brackets at the ends of strings in list
type:List[str]
wiki_music.library.parser.WikipediaRunner.debug_folder

Path to debugging folder.

Type:str
wiki_music.library.parser.WikipediaRunner.files

Gets list of music files in currently set working directory.

Type:List[Path]

See also

_reassign_files()

library.parser.base

Base module for all parser classes.

class wiki_music.library.parser.base.ParserBase(protected_vars: bool)

Bases: object

The base clas for all wiki_music.parser subclasses.

Defines the necessary attributes. The uppercased attributes correspond to tag names for easier access.

Warning

This class is not ment to be instantiated, only inherited.

Note

Uppercased propertie corespond with TAG names so we can easilly use getattr and setattr methods

offline_debug

determines if app will run in offline debug mode

Type:bool
write_json

determines if tracklist in format will be output

Type:bool
multi_threaded

whether to run parts of the code in parallel

Type:bool
_contents

stores the wikipedia page contents

Type:List[str]
_disk_sep

list of tracks separating disks e.g. if CD 1 = (1, 13) and CD 2 = (14, 20), _disk_sep = [0, 12, 19] the offset by one if because of zero first index

Type:List[int]
_disks

holds album disks titles

Type:List[list]
genres

list of genres found in wikipedia page

Type:List[str]
_header

tracklist table headers

Type:List[str]
NLTK_names

list of Person Named Entities extracted from wikipedia page by nltk. See wiki_music.library.parser.process_page.WikipediaParser.NLTK_names for details on how and from which parts of test the names are extracted

Type:List[str]
_personnel

list holding adittional personnel participating on album

Type:List[str]
_appearences

list coresponding to personnel holding for each person list of tracks that the said person has appeared on

Type:List[List[int]]
_subtracks

each entry holds list of subtracks for one track

Type:List[List[str]]
_subtypes

each entry holds list of types for each subtrack

Type:List[List[str]]
work_dir

string with path to directory with music files, this variable can be protected from reseting in __init__ method

Type:Path
log

instance of MUltiLog which sends messages to logger and GUI

Type:wiki_music.utilities.utils.MultiLog
_sections

dictionary of lists of BeautifulSoup objects each entry in the dict contains one whole section of the page and is indexed by that section title

Type:Dict[str, List[Bs4Soup]]
ALBUM

String with album name.

This attribute can be protected from reseting in __init__ method.

Type:str
ALBUMARTIST

String with band name.

This attribute can be protected from reseting in __init__ method.

Type:str
ARTIST

Each entry in list holds list of artists for one track.

Type:List[List[str]]
COMPOSER

Each entry in list holds list of composers for one track.

Type:List[List[str]]
COVERART

Holds cover art read into memory as a bytes object.

Type:bytes
DATE

Album release date.

Type:str
DISCNUMBER

List containing dicsnumber for every track.

Type:List[int]
FILE

List of files on local disk corresponding to each track.

Type:List[str]
GENRE

Holds the album genre selected automatically or by user.

If genres is a list with one item, than it is that item, otherwise user input is required to select from genres list.

Type:str
LYRICS

List of lyrics coresponding to each track.

Type:List[str]
TITLE

List of track names.

Type:List[str]
TRACKNUMBER

List of numbers for each track.

Type:List[str]
TYPE

List of track types.

Type:List[str]
reinit(protected_vars: bool)

Reinitializes parser variables.

library.parser.extractors

Holds various html formats information extractors.

class wiki_music.library.parser.extractors.DataExtractors

Bases: object

Parse various table formats from wikipedia.

Warning

This class is not ment to be instantiated, only inherited.

classmethod _from_list(table: BeautifulSoup) → List[List[List[str]]]

Extract trackist formated as a html list with ‘ol’ and ‘ul’ tags.

Parameters:table (BeautifulSoup) – html list containing the tracklist
Returns:2D array representing table with rows and columns
Return type:List[List[str]]
static _from_table(tables: List[BeautifulSoup]) → List[List[List[str]]]

Extract wkikipedia html table composed of ‘td’ and ‘th’ html tags.

Parameters:tables (List[bs4.BeautifulSoup]) – each BeautifulSoup in list contains one htlm table
Returns:each emement in list is one parsed table, each table is a 2D array of strings representing rows and columns
Return type:List[List[List[str]]]
static _fuzzy_extract(string: str, choices: List[str], limit: Optional[int] = None) → List[str]

Fuzzy extract names, track types .. from brackets behind track name.

Parameters:
  • string (str) – string to match
  • choices (List[str]) – list of possible choisec for string to match
  • limit (int) – max number of extracted choices
Returns:

list of extracted choices

Return type:

List[str]

static _get_artist(cell: str) → List[str]

Splits list of artists in tracklist table cell separated by , or &.

Parameters:cell (str) – string containing artists separated by delimites
Returns:list of artists
Return type:List[str]
static _get_track(cell: str) → Tuple[str, List[str]]

Extract track and subtracks names from table cell.

Parameters:cell (str) – table cell contining track and posiblly subtrack names
Returns:first element is track name and second is a list of subtracks
Return type:tuple
static _html2python_list(table: BeautifulSoup) → List[str]

Converst html list to python list.

Html list can be ordered <ol> or unordered <ul> its elements should be separated by <li> tags.

Parameters:table (BeautifulSoup) – html object parsed by bs4
Returns:each element represents on row in html list
Return type:list

library.parser.in_out

Module with parser inpu-output methods.

class wiki_music.library.parser.in_out.ParserInOut(protected_vars)

Bases: wiki_music.library.parser.base.ParserBase

Encapsulates parser input and output methods.

Class is inherited by wiki_music.library.parser.process_page.WikipediaParser. Takes care of outputing and loading information.

_reassign_files()

Search current working directory and assign files to tracks.

basic_out()

Outputs files in three basic formats.

  1. pickled version of the downloaded wikipedia page
  2. nicely formated html version of the wikipedia page
  3. plain text version of the wikipedia page
bracketed_types

Takes _bracketed_types list, populates and returns it.

See also

wiki_music.utilities.parser_utils.bracket()
function used to append brackets at the ends of strings in list
type:List[str]
data_to_dict(indices: List[int]) → SongList

Converts parser data to list of dictionaries.

If json_dump is enabled list is written to file.

Parameters:indices (List[int]) – indices of files to save

See also

wiki_music.constants.tags.EXTENDED_TAGS
list of tags that are written to each dictionary
Returns:each dictionary in list represents tags of one song
Return type:List[Dict[str, Union[str, int, bytes, list]]]
debug_folder

Path to debugging folder.

Type:str
disk_write()

Save tracklist and personnel to disk in plain text format.

files

Gets list of music files in currently set working directory.

Type:List[Path]
personnel_2_str()

Convert album personnel to string to print out or write to disk.

Returns:nicely formated string representation of personnel
Return type:str
print_tracklist()

Prints tracklist to console.

read_files()

Read tags from files in working directory.

See also

wiki_music.library.tags_io.read_tags()
function that thandles tag reading
save_lyrics(find: bool = True)

Calls lyricsfinder to search for and save lyrics for all tracks.

Parameters:find (bool) – if False lyrics list is initialized only with empty strings

See also

wiki_music.library.lyrics.save_lyrics()
function that handles lyrics finding and saving
tracklist_2_str(to_file=True) → list

Convert tracklist to string to print out or write to disk.

Parameters:to_file (bool) – if False and the tracklist is to be printed to console, highlight headers to make tracklist more readable
Returns:nicely formated string representation of tracklist
Return type:str
write_tags(indices: List[int]) → bool

Write tags to coresponding files. Writing is done in a parallel.

Parameters:indices (List[int]) – indices of files to save

See also

wiki_music.library.tags_io.write_tags()
function that handles tag writing
data_to_dict()
this method prepares tags data in suitable format for writing
wiki_music.utilities.parser_utils.ThreadPool()
class that handles paralelism
Returns:If writing was successfull return true value
Return type:bool

library.parser.preload

Module storing class that takes care of downolading wikipedia page.

class wiki_music.library.parser.preload.CircularDict(maxlen=typing.Union[int, NoneType])

Bases: collections.OrderedDict

Circular dict-like indexable stack with limited capacity.

If maximum length is specified, the oldest item is discarded after adding a new one if stack reached its maximum dict capacity.

Parameters:maxlen (Optional[int]) – maximum stack capacity
class wiki_music.library.parser.preload.Preload(album: str, band: str, offline_debug: bool)

Bases: object

Contoling the preload of wikipedia page.

It is totally self-contained exposes only start, stop and pause methods. Aborts automatically when no album or band is specified. After preload is complete, results are available in results

message

caches preload progress messages

Type:Queue
_check_band

Check if artist from input is the same as the one on wikipedia page.

If the artist is not the same issues warning about mismatch and asks user if he wants to continue.

See also

terminate()
method that takes care of ending the app execution
_cook_soup() → Optional[str]

Parse downloaded wikipedia page with bs4 to BeautifulSoup object.

Then splits the page to dictionary of sections, where each section is indexed by its name.

Returns:if some error occured return string with its description
Return type:Optional[str]
_from_disk() → Optional[str]

Load wikipedia page from pickle file on disk.

Returns:if some error occured return string with its description
Return type:Optional[str]
_from_web() → Optional[str]

Guesses the right wikipedia page from input and downloads it.

Returns:if some error occured return string with its description
Return type:Optional[str]
_preload_run()

Organizes the preload thread and calls other methods.

Based on input decides how to load and parse the wikipedia page.

See also

_from_web()
method to retrieve wikipedia page from internet
_from_disk()
method to retrieve pickled wikipedia page from disk
_cook_soup()
method to parse the page
pause()

Pause the preload thread.

results

Returns downloaded and preprocessed wikipedia page.

Waits until results are available, only then returns.

Returns:
  • WikipediaPage – wikipedia page object
  • bs4.BeautifulSoup – html parsed tree
  • Dict[List[bs4.element.Tags]] – sections of the page split into dict indexed by section names
  • Union[str, Path] – url address of the page or Path to offline pickle file
  • str – error string, or none if no exceptions occured
stop()

Method that stops currently running preload.

unpause()

Unpause the preload thread.

class wiki_music.library.parser.preload.WikiCooker(protected_vars: bool)

Bases: wiki_music.library.parser.base.ParserBase

Downloades wikipedia page and convertes it to WikipediaPage object.

Subsequently important parts of the page are extracted to class attributes. Class has the ability to run preload of the page in background tread as the user types input values in GUI.

Warning

This class is not ment to be instantiated, only inherited.

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page https://pypi.org/project/wikipedia/: used to get the wikipedia page

Parameters:protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not
_page

downloaded page to be parsed by BeautifulSoup

Type:wikipedia.WikipediaPage
_soup

BeautibulSoup object representing the whole page

Type:bs4.BeautifulSoup
_sections

the _soup split to page sections indexed by their titles

Type:Dict[str, List[“Tag”]]
_url

the page url or path to pickle file for offline debug

Type:Union[“Path”, str]
_get_preload_progress() → Generator[str, None, None]

Generator, outputs preload progress messages.

Yields:str – progress messages of preload instance
_preload_id

Get unique id for each preload instance.

Id is a three tuple of album, band and offline_debug flag

Type:Tuple[str, str, str]
get_wiki() → Optional[str]

Wait until preload is finished, then return downloaded data.

Returns:error string if some error occured
Return type:Optional[str]
start_preload()

Starts preload instance and caches its reference under unique id.

Other running preloads are paused. Maximum number of preloads is 10, 1 running and 9 paused. If new preload is added to cache, the the oldest one is destroyed.

stop_preload()

Stops all running preloads and delete reference from cache.

terminate(message: str)

Send message to GUI to ask user if he wishes to terminate the app.

If the answer if yes than parser is destroyed and GUI terminated.

See also

wiki_music.gui_lib.main_window.Checkers._exception_check()
this method handles displaying the message to user
wiki_music.utilities.sync.Control
serves to pass message from parser to GUI
Parameters:message (str) – message to show user when asking if app should terminate

library.parser.process_page

Module containing the whole parser with all the inherited subclasses.

Class WikipediaParser has complete functionallity but its methods need to be called in the correst order to give sensible results.

class wiki_music.library.parser.process_page.WikipediaParser(protected_vars: bool = True, GUI: bool = False, multi_threaded: bool = True)

Bases: wiki_music.library.parser.extractors.DataExtractors, wiki_music.library.parser.preload.WikiCooker, wiki_music.library.parser.in_out.ParserInOut

Class for parsing the wikipedia page and extracting tags data from it.

Warning

Most parser methods are designed to fail gracefully so the extractions can proceed even when some subset of it failed. This has a dark side because it hides errors!!! All warning decorated methods are resilient to any exception defined in wiki_music.utilities.exceptions .

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page

Parameters:
  • protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not
  • GUI (bool) – if True - assume app is running in GUI moden if False - assume app is running in CLI mode
  • multi_threaded (bool) – whether to run some parts of code in threads
NLTK_names

Use nltk to extract person names from sections of wikipedia page.

See also

wiki_music.constants.parser_const.PERSONNEL_SECTIONS
these sections of the page + track_listing are passed to nltk to look for names
Raises:wiki_music.utilities.exceptions.NoNames2ExtractException – vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.
_complete()

Recursively complete inforamtion in parser lists.

Traverses: _composers, artists and _personnel and checks each name with each if some is found to be incomplete then it is replaced by longer version from other list.

_info_tracks()

Parse track names for aditional information.

Like artist, composer, type… . Also get rid of useless strings like bonus track, featuring… . These informations are assumed to be enclosed in brackets behind the track name.

_merge_artist_personnel()

Assigns personnel to track artists.

The assignment is done base on _appearences which specify tracks for each person of personnel.

_process_tracks(data: List[List[List[str]]]) → Tuple[List[str], List[List[str]]]

Process raw extracted list of album CD trackists for track details.

Parameters:data (List[List[List[str]]]) – list of trackists each for one cd, tracklists, consist of list, each representing one row, and each row has cells
Returns:list of tracks and for each track list of atrists
Return type:Tuple[List[str], List[List[str]]]
get_composers() → List[List[str]]

Extract composers from wikipedia page.

Employs complex logic. First Person named entities are extracted by nltk. Then merges them with composers. After that uses this list of names to try to guess composers and coresponding tracks from short text above the table.

See also

get_personnel()
this method should run first because it populates the _personnel used by this method

Warning

This method is not as robust as it should be. It fails for many types of formating.

Returns:list of composers for every track
Return type:List[List[str]]
get_contents() → List[str]

Extract page contets from keys in _sections dictionary.

Raises:wiki_music.utilities.exceptions.NoContentsException – if no contents were retrieved
Returns:page contents as a list
Return type:List[str]
get_cover_art(in_thread: bool = False) → Optional[bytes]

Get album cover art.

Extracts from information box in the top right corner of wikipedia page. For app use it runs in a separate thread because the cover art data is not used by parser in any way, so it can be downloaded in the background. Populates COVERART

Parameters:in_thread (bool) – if false, doesn’t run in thread and blocks until cover art is found
Raises:wiki_music.utilities.exceptions.NoCoverArtException – if cover art url could not be found
get_genres() → List[str]

Get list of album genres.

Extracts from information box in the top right corner of wikipedia page. If found genre if only one then assigns is value to GENRE

Raises:wiki_music.utilities.exceptions.NoGenreException – if no genres could be extracted from page
Returns:list of found genres
Return type:List[str]
get_personnel() → Tuple[List[str], List[List[int]]]

Extract personnel from wikipedia page.

Sxtraction is done from following sections: wiki_music.constants.parser_const.PERSONNEL_SECTIONS then parse these entries for additional data like apperences on tracks.

Raises:wiki_music.utilities.exceptions.NoPersonnelException – if no table or list with personnel was found
Returns:two lists, first contains found personnel and the second has for each person list of tracks on which the person appeared
Return type:Tuple[List[str], List[List[int]]]
get_release_date() → str

Get album release date.

Extracts from information box in the top right corner of wikipedia page. Populates:attr:wiki_music.DATE

Raises:utilities.exceptions.NoReleaseDateException – raised if no release date was extracted
Returns:release year as a string
Return type:str
get_tracks() → Tuple[List[str], List[List[str]]]

Attempt to extract tracklist from html table or list on wikipedia.

See also

_from_table()
used to parse tracklist in htlm table
_from_list()
used to parse tracklist in html list
_process_tracks()
method called to parse raw extracted table and get song numbers, artists, composers …
Raises:wiki_music.utilities.exceptions.NoTracklistException – raised if no tracklist in any format was found
Returns:list of tracks and for each track list of atrists
Return type:Tuple[List[str], List[List[str]]]
merge_artist_composers()

Move all artists to composers list.

This is done or left out based on user input