wiki_music.library.parser module¶

Warning

Documentation is stil under construction some things might not be up to date.

library.parser.WikipediaRunner¶

Toplevel Wikipedia Parser class.

Inherits all other parser subclasses. This is the class that is intended for user interaction. Its methods know how to run the parser in order to produce meaningfull results.

Warning

This is the only parser class that is ment for user interaction. Calling its subclasses directly might result in unexpected behaviour.

param albumartist:
param album:	album name
type album:	str
	band name
type albumartist:
	str
param work_dir:	directory with music files
type work_dir:	str
param with_log:	If parser should output its progress to logger, only for CLI mode
type with_log:	bool
param GUI:	if True - assume app is running in GUI moden if False - assume app is running in CLI mode
type GUI:	bool
param protected_vars:
	whether to initialize protected variables or not
type protected_vars:
	bool
param multi_threaded:
	whether to run some parts of code in threads
type multi_threaded:
	bool

wiki_music.library.parser.WikipediaRunner.ALBUM¶

String with album name.

This attribute can be protected from reseting in __init__ method.

Type:	str

wiki_music.library.parser.WikipediaRunner.ALBUMARTIST¶

String with band name.

This attribute can be protected from reseting in __init__ method.

Type:	str

wiki_music.library.parser.WikipediaRunner.ARTIST¶

Each entry in list holds list of artists for one track.

Type:	List[List[str]]

wiki_music.library.parser.WikipediaRunner.COMPOSER¶

Each entry in list holds list of composers for one track.

Type:	List[List[str]]

wiki_music.library.parser.WikipediaRunner.COVERART¶

Holds cover art read into memory as a bytes object.

Type:	bytes

wiki_music.library.parser.WikipediaRunner.DATE¶

Album release date.

Type:	str

wiki_music.library.parser.WikipediaRunner.DISCNUMBER¶

List containing dicsnumber for every track.

Type:	List[int]

wiki_music.library.parser.WikipediaRunner.FILE¶

List of files on local disk corresponding to each track.

Type:	List[str]

wiki_music.library.parser.WikipediaRunner.GENRE¶

Holds the album genre selected automatically or by user.

If genres is a list with one item, than it is that item, otherwise user input is required to select from genres list.

Type:	str

wiki_music.library.parser.WikipediaRunner.LYRICS¶

List of lyrics coresponding to each track.

Type:	List[str]

wiki_music.library.parser.WikipediaRunner.NLTK_names¶

Use nltk to extract person names from sections of wikipedia page.

See also

wiki_music.constants.parser_const.PERSONNEL_SECTIONS: these sections of the page + track_listing are passed to nltk to look for names

Raises:	`wiki_music.utilities.exceptions.NoNames2ExtractException` – vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.

wiki_music.library.parser.WikipediaRunner.TITLE¶

List of track names.

Type:	List[str]

wiki_music.library.parser.WikipediaRunner.TRACKNUMBER¶

List of numbers for each track.

Type:	List[str]

wiki_music.library.parser.WikipediaRunner.TYPE¶

List of track types.

Type:	List[str]

wiki_music.library.parser.WikipediaRunner._preload_id¶

Get unique id for each preload instance.

Id is a three tuple of album, band and offline_debug flag

Type:	Tuple[str, str, str]

wiki_music.library.parser.WikipediaRunner.bracketed_types¶

Takes _bracketed_types list, populates and returns it.

See also

wiki_music.utilities.parser_utils.bracket(): function used to append brackets at the ends of strings in list

type:	List[str]

wiki_music.library.parser.WikipediaRunner.debug_folder¶

Path to debugging folder.

Type:	str

wiki_music.library.parser.WikipediaRunner.files¶

Gets list of music files in currently set working directory.

Type:	List[Path]

See also

_reassign_files()

library.parser.base¶

Base module for all parser classes.

class wiki_music.library.parser.base.ParserBase(protected_vars: bool)¶

Bases: object

The base clas for all wiki_music.parser subclasses.

Defines the necessary attributes. The uppercased attributes correspond to tag names for easier access.

Warning

This class is not ment to be instantiated, only inherited.

Note

Uppercased propertie corespond with TAG names so we can easilly use getattr and setattr methods

offline_debug¶

determines if app will run in offline debug mode

Type:	bool

write_json¶

determines if tracklist in format will be output

Type:	bool

multi_threaded¶

whether to run parts of the code in parallel

Type:	bool

_contents¶

stores the wikipedia page contents

Type:	List[str]

_disk_sep¶

list of tracks separating disks e.g. if CD 1 = (1, 13) and CD 2 = (14, 20), _disk_sep = [0, 12, 19] the offset by one if because of zero first index

Type:	List[int]

_disks¶

holds album disks titles

Type:	List[list]

genres¶

list of genres found in wikipedia page

Type:	List[str]

_header¶

tracklist table headers

Type:	List[str]

NLTK_names¶

list of Person Named Entities extracted from wikipedia page by nltk. See wiki_music.library.parser.process_page.WikipediaParser.NLTK_names for details on how and from which parts of test the names are extracted

Type:	List[str]

_personnel¶

list holding adittional personnel participating on album

Type:	List[str]

_appearences¶

list coresponding to personnel holding for each person list of tracks that the said person has appeared on

Type:	List[List[int]]

_subtracks¶

each entry holds list of subtracks for one track

Type:	List[List[str]]

_subtypes¶

each entry holds list of types for each subtrack

Type:	List[List[str]]

work_dir¶

string with path to directory with music files, this variable can be protected from reseting in __init__ method

Type:	Path

log¶

instance of MUltiLog which sends messages to logger and GUI

Type:	`wiki_music.utilities.utils.MultiLog`

_sections¶

dictionary of lists of BeautifulSoup objects each entry in the dict contains one whole section of the page and is indexed by that section title

Type:	Dict[str, List[Bs4Soup]]

ALBUM¶

String with album name.

This attribute can be protected from reseting in __init__ method.

Type:	str

ALBUMARTIST¶

String with band name.

This attribute can be protected from reseting in __init__ method.

Type:	str

ARTIST¶

Each entry in list holds list of artists for one track.

Type:	List[List[str]]

COMPOSER¶

Each entry in list holds list of composers for one track.

Type:	List[List[str]]

COVERART¶

Holds cover art read into memory as a bytes object.

Type:	bytes

DATE¶

Album release date.

Type:	str

DISCNUMBER¶

List containing dicsnumber for every track.

Type:	List[int]

FILE¶

List of files on local disk corresponding to each track.

Type:	List[str]

GENRE¶

Holds the album genre selected automatically or by user.

If genres is a list with one item, than it is that item, otherwise user input is required to select from genres list.

Type:	str

LYRICS¶

List of lyrics coresponding to each track.

Type:	List[str]

TITLE¶

List of track names.

Type:	List[str]

TRACKNUMBER¶

List of numbers for each track.

Type:	List[str]

TYPE¶

List of track types.

Type:	List[str]

reinit(protected_vars: bool)¶: Reinitializes parser variables.

library.parser.extractors¶

Holds various html formats information extractors.

class wiki_music.library.parser.extractors.DataExtractors¶

Bases: object

Parse various table formats from wikipedia.

Warning

This class is not ment to be instantiated, only inherited.

classmethod _from_list(table: BeautifulSoup) → List[List[List[str]]]¶

Extract trackist formated as a html list with ‘ol’ and ‘ul’ tags.

See also

_html2python_list()

Parameters:	table (BeautifulSoup) – html list containing the tracklist
Returns:	2D array representing table with rows and columns
Return type:	List[List[str]]

static _from_table(tables: List[BeautifulSoup]) → List[List[List[str]]]¶

Extract wkikipedia html table composed of ‘td’ and ‘th’ html tags.

Parameters:	tables (List[bs4.BeautifulSoup]) – each BeautifulSoup in list contains one htlm table
Returns:	each emement in list is one parsed table, each table is a 2D array of strings representing rows and columns
Return type:	List[List[List[str]]]

static _fuzzy_extract(string: str, choices: List[str], limit: Optional[int] = None) → List[str]¶

Fuzzy extract names, track types .. from brackets behind track name.

Parameters:	string (str) – string to match choices (List[str]) – list of possible choisec for string to match limit (int) – max number of extracted choices
Returns:	list of extracted choices
Return type:	List[str]

static _get_artist(cell: str) → List[str]¶

Splits list of artists in tracklist table cell separated by , or &.

Parameters:	cell (str) – string containing artists separated by delimites
Returns:	list of artists
Return type:	List[str]

static _get_track(cell: str) → Tuple[str, List[str]]¶

Extract track and subtracks names from table cell.

Parameters:	cell (str) – table cell contining track and posiblly subtrack names
Returns:	first element is track name and second is a list of subtracks
Return type:	tuple

static _html2python_list(table: BeautifulSoup) → List[str]¶

Converst html list to python list.

Html list can be ordered <ol> or unordered <ul> its elements should be separated by <li> tags.

Parameters:	table (BeautifulSoup) – html object parsed by bs4
Returns:	each element represents on row in html list
Return type:	list

library.parser.in_out¶

Module with parser inpu-output methods.

class wiki_music.library.parser.in_out.ParserInOut(protected_vars)¶

Bases: wiki_music.library.parser.base.ParserBase

Encapsulates parser input and output methods.

Class is inherited by wiki_music.library.parser.process_page.WikipediaParser. Takes care of outputing and loading information.

_reassign_files()¶: Search current working directory and assign files to tracks.

basic_out()¶

Outputs files in three basic formats.

pickled version of the downloaded wikipedia page
nicely formated html version of the wikipedia page
plain text version of the wikipedia page

bracketed_types¶

Takes _bracketed_types list, populates and returns it.

See also

wiki_music.utilities.parser_utils.bracket(): function used to append brackets at the ends of strings in list

type:	List[str]

data_to_dict(indices: List[int]) → SongList¶

Converts parser data to list of dictionaries.

If json_dump is enabled list is written to file.

Parameters:	indices (List[int]) – indices of files to save

See also

wiki_music.constants.tags.EXTENDED_TAGS: list of tags that are written to each dictionary

Returns:	each dictionary in list represents tags of one song
Return type:	List[Dict[str, Union[str, int, bytes, list]]]

debug_folder¶

Path to debugging folder.

Type:	str

disk_write()¶: Save tracklist and personnel to disk in plain text format.

files¶

Gets list of music files in currently set working directory.

Type:	List[Path]

See also

_reassign_files()

personnel_2_str()¶

Convert album personnel to string to print out or write to disk.

Returns:	nicely formated string representation of personnel
Return type:	str

print_tracklist()¶: Prints tracklist to console.

See also

tracklist_2_str()

read_files()¶

Read tags from files in working directory.

See also

wiki_music.library.tags_io.read_tags(): function that thandles tag reading

save_lyrics(find: bool = True)¶

Calls lyricsfinder to search for and save lyrics for all tracks.

Parameters:	find (bool) – if False lyrics list is initialized only with empty strings

See also

wiki_music.library.lyrics.save_lyrics(): function that handles lyrics finding and saving

tracklist_2_str(to_file=True) → list¶

Convert tracklist to string to print out or write to disk.

Parameters:	to_file (bool) – if False and the tracklist is to be printed to console, highlight headers to make tracklist more readable
Returns:	nicely formated string representation of tracklist
Return type:	str

write_tags(indices: List[int]) → bool¶

Write tags to coresponding files. Writing is done in a parallel.

Parameters:	indices (List[int]) – indices of files to save

See also

wiki_music.library.tags_io.write_tags(): function that handles tag writing
data_to_dict(): this method prepares tags data in suitable format for writing
wiki_music.utilities.parser_utils.ThreadPool(): class that handles paralelism

Returns:	If writing was successfull return true value
Return type:	bool

library.parser.preload¶

Module storing class that takes care of downolading wikipedia page.

class wiki_music.library.parser.preload.CircularDict(maxlen=typing.Union[int, NoneType])¶

Bases: collections.OrderedDict

Circular dict-like indexable stack with limited capacity.

If maximum length is specified, the oldest item is discarded after adding a new one if stack reached its maximum dict capacity.

Parameters:	maxlen (Optional[int]) – maximum stack capacity

class wiki_music.library.parser.preload.Preload(album: str, band: str, offline_debug: bool)¶

Bases: object

Contoling the preload of wikipedia page.

It is totally self-contained exposes only start, stop and pause methods. Aborts automatically when no album or band is specified. After preload is complete, results are available in results

message¶

caches preload progress messages

Type:	`Queue`

_check_band¶

Check if artist from input is the same as the one on wikipedia page.

If the artist is not the same issues warning about mismatch and asks user if he wants to continue.

See also

terminate(): method that takes care of ending the app execution

_cook_soup() → Optional[str]¶

Parse downloaded wikipedia page with bs4 to BeautifulSoup object.

Then splits the page to dictionary of sections, where each section is indexed by its name.

Returns:	if some error occured return string with its description
Return type:	Optional[str]

_from_disk() → Optional[str]¶

Load wikipedia page from pickle file on disk.

Returns:	if some error occured return string with its description
Return type:	Optional[str]

_from_web() → Optional[str]¶

Guesses the right wikipedia page from input and downloads it.

Returns:	if some error occured return string with its description
Return type:	Optional[str]

_preload_run()¶

Organizes the preload thread and calls other methods.

Based on input decides how to load and parse the wikipedia page.

See also

_from_web(): method to retrieve wikipedia page from internet
_from_disk(): method to retrieve pickled wikipedia page from disk
_cook_soup(): method to parse the page

pause()¶: Pause the preload thread.

results¶

Returns downloaded and preprocessed wikipedia page.

Waits until results are available, only then returns.

Returns:	WikipediaPage – wikipedia page object bs4.BeautifulSoup – html parsed tree Dict[List[bs4.element.Tags]] – sections of the page split into dict indexed by section names Union[str, Path] – url address of the page or Path to offline pickle file str – error string, or none if no exceptions occured

stop()¶: Method that stops currently running preload.

unpause()¶: Unpause the preload thread.

class wiki_music.library.parser.preload.WikiCooker(protected_vars: bool)¶

Bases: wiki_music.library.parser.base.ParserBase

Downloades wikipedia page and convertes it to WikipediaPage object.

Subsequently important parts of the page are extracted to class attributes. Class has the ability to run preload of the page in background tread as the user types input values in GUI.

Warning

This class is not ment to be instantiated, only inherited.

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page https://pypi.org/project/wikipedia/: used to get the wikipedia page

Parameters:	protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not

_page¶

downloaded page to be parsed by BeautifulSoup

Type:	wikipedia.WikipediaPage

_soup¶

BeautibulSoup object representing the whole page

Type:	bs4.BeautifulSoup

_sections¶

the _soup split to page sections indexed by their titles

Type:	Dict[str, List[“Tag”]]

_url¶

the page url or path to pickle file for offline debug

Type:	Union[“Path”, str]

_get_preload_progress() → Generator[str, None, None]¶

Generator, outputs preload progress messages.

Yields:	str – progress messages of preload instance

_preload_id¶

Get unique id for each preload instance.

Id is a three tuple of album, band and offline_debug flag

Type:	Tuple[str, str, str]

get_wiki() → Optional[str]¶

Wait until preload is finished, then return downloaded data.

Returns:	error string if some error occured
Return type:	Optional[str]

start_preload()¶

Starts preload instance and caches its reference under unique id.

Other running preloads are paused. Maximum number of preloads is 10, 1 running and 9 paused. If new preload is added to cache, the the oldest one is destroyed.

stop_preload()¶: Stops all running preloads and delete reference from cache.

terminate(message: str)¶

Send message to GUI to ask user if he wishes to terminate the app.

If the answer if yes than parser is destroyed and GUI terminated.

See also

wiki_music.gui_lib.main_window.Checkers._exception_check(): this method handles displaying the message to user
wiki_music.utilities.sync.Control: serves to pass message from parser to GUI

Parameters:	message (str) – message to show user when asking if app should terminate

library.parser.process_page¶

Module containing the whole parser with all the inherited subclasses.

Class WikipediaParser has complete functionallity but its methods need to be called in the correst order to give sensible results.

class wiki_music.library.parser.process_page.WikipediaParser(protected_vars: bool = True, GUI: bool = False, multi_threaded: bool = True)¶

Bases: wiki_music.library.parser.extractors.DataExtractors, wiki_music.library.parser.preload.WikiCooker, wiki_music.library.parser.in_out.ParserInOut

Class for parsing the wikipedia page and extracting tags data from it.

Warning

Most parser methods are designed to fail gracefully so the extractions can proceed even when some subset of it failed. This has a dark side because it hides errors!!! All warning decorated methods are resilient to any exception defined in wiki_music.utilities.exceptions .

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page

Parameters:	protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not GUI (bool) – if True - assume app is running in GUI moden if False - assume app is running in CLI mode multi_threaded (bool) – whether to run some parts of code in threads

NLTK_names¶

Use nltk to extract person names from sections of wikipedia page.

See also

wiki_music.constants.parser_const.PERSONNEL_SECTIONS: these sections of the page + track_listing are passed to nltk to look for names

Raises:	`wiki_music.utilities.exceptions.NoNames2ExtractException` – vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.

_complete()¶

Recursively complete inforamtion in parser lists.

Traverses: _composers, artists and _personnel and checks each name with each if some is found to be incomplete then it is replaced by longer version from other list.

_info_tracks()¶

Parse track names for aditional information.

Like artist, composer, type… . Also get rid of useless strings like bonus track, featuring… . These informations are assumed to be enclosed in brackets behind the track name.

_merge_artist_personnel()¶

Assigns personnel to track artists.

The assignment is done base on _appearences which specify tracks for each person of personnel.

_process_tracks(data: List[List[List[str]]]) → Tuple[List[str], List[List[str]]]¶

Process raw extracted list of album CD trackists for track details.

Parameters:	data (List[List[List[str]]]) – list of trackists each for one cd, tracklists, consist of list, each representing one row, and each row has cells
Returns:	list of tracks and for each track list of atrists
Return type:	Tuple[List[str], List[List[str]]]

get_composers() → List[List[str]]¶

Extract composers from wikipedia page.

Employs complex logic. First Person named entities are extracted by nltk. Then merges them with composers. After that uses this list of names to try to guess composers and coresponding tracks from short text above the table.

See also

get_personnel(): this method should run first because it populates the _personnel used by this method

Warning

This method is not as robust as it should be. It fails for many types of formating.

Returns:	list of composers for every track
Return type:	List[List[str]]

get_contents() → List[str]¶

Extract page contets from keys in _sections dictionary.

Raises:	`wiki_music.utilities.exceptions.NoContentsException` – if no contents were retrieved
Returns:	page contents as a list
Return type:	List[str]

get_cover_art(in_thread: bool = False) → Optional[bytes]¶

Get album cover art.

Extracts from information box in the top right corner of wikipedia page. For app use it runs in a separate thread because the cover art data is not used by parser in any way, so it can be downloaded in the background. Populates COVERART

Parameters:	in_thread (bool) – if false, doesn’t run in thread and blocks until cover art is found
Raises:	`wiki_music.utilities.exceptions.NoCoverArtException` – if cover art url could not be found

get_genres() → List[str]¶

Get list of album genres.

Extracts from information box in the top right corner of wikipedia page. If found genre if only one then assigns is value to GENRE

Raises:	`wiki_music.utilities.exceptions.NoGenreException` – if no genres could be extracted from page
Returns:	list of found genres
Return type:	List[str]

get_personnel() → Tuple[List[str], List[List[int]]]¶

Extract personnel from wikipedia page.

Sxtraction is done from following sections: wiki_music.constants.parser_const.PERSONNEL_SECTIONS then parse these entries for additional data like apperences on tracks.

Raises:	`wiki_music.utilities.exceptions.NoPersonnelException` – if no table or list with personnel was found
Returns:	two lists, first contains found personnel and the second has for each person list of tracks on which the person appeared
Return type:	Tuple[List[str], List[List[int]]]

get_release_date() → str¶

Get album release date.

Extracts from information box in the top right corner of wikipedia page. Populates:attr:wiki_music.DATE

Raises:	`utilities.exceptions.NoReleaseDateException` – raised if no release date was extracted
Returns:	release year as a string
Return type:	str

get_tracks() → Tuple[List[str], List[List[str]]]¶

Attempt to extract tracklist from html table or list on wikipedia.

See also

_from_table(): used to parse tracklist in htlm table
_from_list(): used to parse tracklist in html list
_process_tracks(): method called to parse raw extracted table and get song numbers, artists, composers …

Raises:	`wiki_music.utilities.exceptions.NoTracklistException` – raised if no tracklist in any format was found
Returns:	list of tracks and for each track list of atrists
Return type:	Tuple[List[str], List[List[str]]]

merge_artist_composers()¶

Move all artists to composers list.

This is done or left out based on user input