wiki_music.library.parser module¶
Warning
Documentation is stil under construction some things might not be up to date.
library.parser.WikipediaRunner¶
Toplevel Wikipedia Parser class.
Inherits all other parser subclasses. This is the class that is intended for user interaction. Its methods know how to run the parser in order to produce meaningfull results.
Warning
This is the only parser class that is ment for user interaction. Calling its subclasses directly might result in unexpected behaviour.
param album: | album name |
---|---|
type album: | str |
param albumartist: | |
band name | |
type albumartist: | |
str | |
param work_dir: | directory with music files |
type work_dir: | str |
param with_log: | If parser should output its progress to logger, only for CLI mode |
type with_log: | bool |
param GUI: | if True - assume app is running in GUI moden if False - assume app is running in CLI mode |
type GUI: | bool |
param protected_vars: | |
whether to initialize protected variables or not | |
type protected_vars: | |
bool | |
param multi_threaded: | |
whether to run some parts of code in threads | |
type multi_threaded: | |
bool |
-
wiki_music.library.parser.WikipediaRunner.
ALBUM
¶ String with album name.
This attribute can be protected from reseting in __init__ method.
Type: str
-
wiki_music.library.parser.WikipediaRunner.
ALBUMARTIST
¶ String with band name.
This attribute can be protected from reseting in __init__ method.
Type: str
-
wiki_music.library.parser.WikipediaRunner.
ARTIST
¶ Each entry in list holds list of artists for one track.
Type: List[List[str]]
-
wiki_music.library.parser.WikipediaRunner.
COMPOSER
¶ Each entry in list holds list of composers for one track.
Type: List[List[str]]
-
wiki_music.library.parser.WikipediaRunner.
COVERART
¶ Holds cover art read into memory as a bytes object.
Type: bytes
-
wiki_music.library.parser.WikipediaRunner.
DATE
¶ Album release date.
Type: str
-
wiki_music.library.parser.WikipediaRunner.
DISCNUMBER
¶ List containing dicsnumber for every track.
Type: List[int]
-
wiki_music.library.parser.WikipediaRunner.
FILE
¶ List of files on local disk corresponding to each track.
Type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
GENRE
¶ Holds the album genre selected automatically or by user.
If
genres
is a list with one item, than it is that item, otherwise user input is required to select from genres list.Type: str
-
wiki_music.library.parser.WikipediaRunner.
LYRICS
¶ List of lyrics coresponding to each track.
Type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
NLTK_names
¶ Use nltk to extract person names from sections of wikipedia page.
See also
wiki_music.constants.parser_const.PERSONNEL_SECTIONS
- these sections of the page + track_listing are passed to nltk to look for names
Raises: wiki_music.utilities.exceptions.NoNames2ExtractException
– vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.
-
wiki_music.library.parser.WikipediaRunner.
TITLE
¶ List of track names.
Type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
TRACKNUMBER
¶ List of numbers for each track.
Type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
TYPE
¶ List of track types.
Type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
_preload_id
¶ Get unique id for each preload instance.
Id is a three tuple of album, band and offline_debug flag
Type: Tuple[str, str, str]
-
wiki_music.library.parser.WikipediaRunner.
bracketed_types
¶ Takes _bracketed_types list, populates and returns it.
See also
wiki_music.utilities.parser_utils.bracket()
- function used to append brackets at the ends of strings in list
type: List[str]
-
wiki_music.library.parser.WikipediaRunner.
debug_folder
¶ Path to debugging folder.
Type: str
-
wiki_music.library.parser.WikipediaRunner.
files
¶ Gets list of music files in currently set working directory.
Type: List[Path] See also
_reassign_files()
library.parser.base¶
Base module for all parser classes.
-
class
wiki_music.library.parser.base.
ParserBase
(protected_vars: bool)¶ Bases:
object
The base clas for all
wiki_music.parser
subclasses.Defines the necessary attributes. The uppercased attributes correspond to tag names for easier access.
Warning
This class is not ment to be instantiated, only inherited.
Note
Uppercased propertie corespond with TAG names so we can easilly use getattr and setattr methods
-
offline_debug
¶ determines if app will run in offline debug mode
Type: bool
-
write_json
¶ determines if tracklist in format will be output
Type: bool
-
multi_threaded
¶ whether to run parts of the code in parallel
Type: bool
-
_contents
¶ stores the wikipedia page contents
Type: List[str]
-
_disk_sep
¶ list of tracks separating disks e.g. if CD 1 = (1, 13) and CD 2 = (14, 20), _disk_sep = [0, 12, 19] the offset by one if because of zero first index
Type: List[int]
-
_disks
¶ holds album disks titles
Type: List[list]
-
genres
¶ list of genres found in wikipedia page
Type: List[str]
-
_header
¶ tracklist table headers
Type: List[str]
-
NLTK_names
¶ list of Person Named Entities extracted from wikipedia page by nltk. See
wiki_music.library.parser.process_page.WikipediaParser.NLTK_names
for details on how and from which parts of test the names are extractedType: List[str]
-
_personnel
¶ list holding adittional personnel participating on album
Type: List[str]
-
_appearences
¶ list coresponding to personnel holding for each person list of tracks that the said person has appeared on
Type: List[List[int]]
-
_subtracks
¶ each entry holds list of subtracks for one track
Type: List[List[str]]
-
_subtypes
¶ each entry holds list of types for each subtrack
Type: List[List[str]]
-
work_dir
¶ string with path to directory with music files, this variable can be protected from reseting in __init__ method
Type: Path
-
log
¶ instance of MUltiLog which sends messages to logger and GUI
Type: wiki_music.utilities.utils.MultiLog
-
_sections
¶ dictionary of lists of BeautifulSoup objects each entry in the dict contains one whole section of the page and is indexed by that section title
Type: Dict[str, List[Bs4Soup]]
-
ALBUM
¶ String with album name.
This attribute can be protected from reseting in __init__ method.
Type: str
-
ALBUMARTIST
¶ String with band name.
This attribute can be protected from reseting in __init__ method.
Type: str
-
ARTIST
¶ Each entry in list holds list of artists for one track.
Type: List[List[str]]
-
COMPOSER
¶ Each entry in list holds list of composers for one track.
Type: List[List[str]]
-
COVERART
¶ Holds cover art read into memory as a bytes object.
Type: bytes
-
DATE
¶ Album release date.
Type: str
-
DISCNUMBER
¶ List containing dicsnumber for every track.
Type: List[int]
-
FILE
¶ List of files on local disk corresponding to each track.
Type: List[str]
-
GENRE
¶ Holds the album genre selected automatically or by user.
If
genres
is a list with one item, than it is that item, otherwise user input is required to select from genres list.Type: str
-
LYRICS
¶ List of lyrics coresponding to each track.
Type: List[str]
-
TITLE
¶ List of track names.
Type: List[str]
-
TRACKNUMBER
¶ List of numbers for each track.
Type: List[str]
-
TYPE
¶ List of track types.
Type: List[str]
-
reinit
(protected_vars: bool)¶ Reinitializes parser variables.
-
library.parser.extractors¶
Holds various html formats information extractors.
-
class
wiki_music.library.parser.extractors.
DataExtractors
¶ Bases:
object
Parse various table formats from wikipedia.
Warning
This class is not ment to be instantiated, only inherited.
-
classmethod
_from_list
(table: BeautifulSoup) → List[List[List[str]]]¶ Extract trackist formated as a html list with ‘ol’ and ‘ul’ tags.
See also
Parameters: table (BeautifulSoup) – html list containing the tracklist Returns: 2D array representing table with rows and columns Return type: List[List[str]]
-
static
_from_table
(tables: List[BeautifulSoup]) → List[List[List[str]]]¶ Extract wkikipedia html table composed of ‘td’ and ‘th’ html tags.
Parameters: tables (List[bs4.BeautifulSoup]) – each BeautifulSoup in list contains one htlm table Returns: each emement in list is one parsed table, each table is a 2D array of strings representing rows and columns Return type: List[List[List[str]]]
-
static
_fuzzy_extract
(string: str, choices: List[str], limit: Optional[int] = None) → List[str]¶ Fuzzy extract names, track types .. from brackets behind track name.
Parameters: - string (str) – string to match
- choices (List[str]) – list of possible choisec for string to match
- limit (int) – max number of extracted choices
Returns: list of extracted choices
Return type: List[str]
-
static
_get_artist
(cell: str) → List[str]¶ Splits list of artists in tracklist table cell separated by , or &.
Parameters: cell (str) – string containing artists separated by delimites Returns: list of artists Return type: List[str]
-
static
_get_track
(cell: str) → Tuple[str, List[str]]¶ Extract track and subtracks names from table cell.
Parameters: cell (str) – table cell contining track and posiblly subtrack names Returns: first element is track name and second is a list of subtracks Return type: tuple
-
static
_html2python_list
(table: BeautifulSoup) → List[str]¶ Converst html list to python list.
Html list can be ordered <ol> or unordered <ul> its elements should be separated by <li> tags.
Parameters: table (BeautifulSoup) – html object parsed by bs4 Returns: each element represents on row in html list Return type: list
-
classmethod
library.parser.in_out¶
Module with parser inpu-output methods.
-
class
wiki_music.library.parser.in_out.
ParserInOut
(protected_vars)¶ Bases:
wiki_music.library.parser.base.ParserBase
Encapsulates parser input and output methods.
Class is inherited by
wiki_music.library.parser.process_page.WikipediaParser
. Takes care of outputing and loading information.-
_reassign_files
()¶ Search current working directory and assign files to tracks.
-
basic_out
()¶ Outputs files in three basic formats.
- pickled version of the downloaded wikipedia page
- nicely formated html version of the wikipedia page
- plain text version of the wikipedia page
-
bracketed_types
¶ Takes _bracketed_types list, populates and returns it.
See also
wiki_music.utilities.parser_utils.bracket()
- function used to append brackets at the ends of strings in list
type: List[str]
-
data_to_dict
(indices: List[int]) → SongList¶ Converts parser data to list of dictionaries.
If json_dump is enabled list is written to file.
Parameters: indices (List[int]) – indices of files to save See also
wiki_music.constants.tags.EXTENDED_TAGS
- list of tags that are written to each dictionary
Returns: each dictionary in list represents tags of one song Return type: List[Dict[str, Union[str, int, bytes, list]]]
-
debug_folder
¶ Path to debugging folder.
Type: str
-
disk_write
()¶ Save tracklist and personnel to disk in plain text format.
-
files
¶ Gets list of music files in currently set working directory.
Type: List[Path] See also
-
personnel_2_str
()¶ Convert album personnel to string to print out or write to disk.
Returns: nicely formated string representation of personnel Return type: str
-
print_tracklist
()¶ Prints tracklist to console.
See also
-
read_files
()¶ Read tags from files in working directory.
See also
wiki_music.library.tags_io.read_tags()
- function that thandles tag reading
-
save_lyrics
(find: bool = True)¶ Calls lyricsfinder to search for and save lyrics for all tracks.
Parameters: find (bool) – if False lyrics list is initialized only with empty strings See also
wiki_music.library.lyrics.save_lyrics()
- function that handles lyrics finding and saving
-
tracklist_2_str
(to_file=True) → list¶ Convert tracklist to string to print out or write to disk.
Parameters: to_file (bool) – if False and the tracklist is to be printed to console, highlight headers to make tracklist more readable Returns: nicely formated string representation of tracklist Return type: str
Write tags to coresponding files. Writing is done in a parallel.
Parameters: indices (List[int]) – indices of files to save See also
wiki_music.library.tags_io.write_tags()
- function that handles tag writing
data_to_dict()
- this method prepares tags data in suitable format for writing
wiki_music.utilities.parser_utils.ThreadPool()
- class that handles paralelism
Returns: If writing was successfull return true value Return type: bool
-
library.parser.preload¶
Module storing class that takes care of downolading wikipedia page.
-
class
wiki_music.library.parser.preload.
CircularDict
(maxlen=typing.Union[int, NoneType])¶ Bases:
collections.OrderedDict
Circular dict-like indexable stack with limited capacity.
If maximum length is specified, the oldest item is discarded after adding a new one if stack reached its maximum dict capacity.
Parameters: maxlen (Optional[int]) – maximum stack capacity
-
class
wiki_music.library.parser.preload.
Preload
(album: str, band: str, offline_debug: bool)¶ Bases:
object
Contoling the preload of wikipedia page.
It is totally self-contained exposes only start, stop and pause methods. Aborts automatically when no album or band is specified. After preload is complete, results are available in
results
-
message
¶ caches preload progress messages
Type: Queue
-
_check_band
¶ Check if artist from input is the same as the one on wikipedia page.
If the artist is not the same issues warning about mismatch and asks user if he wants to continue.
See also
terminate()
- method that takes care of ending the app execution
-
_cook_soup
() → Optional[str]¶ Parse downloaded wikipedia page with bs4 to BeautifulSoup object.
Then splits the page to dictionary of sections, where each section is indexed by its name.
Returns: if some error occured return string with its description Return type: Optional[str]
-
_from_disk
() → Optional[str]¶ Load wikipedia page from pickle file on disk.
Returns: if some error occured return string with its description Return type: Optional[str]
-
_from_web
() → Optional[str]¶ Guesses the right wikipedia page from input and downloads it.
Returns: if some error occured return string with its description Return type: Optional[str]
-
_preload_run
()¶ Organizes the preload thread and calls other methods.
Based on input decides how to load and parse the wikipedia page.
See also
_from_web()
- method to retrieve wikipedia page from internet
_from_disk()
- method to retrieve pickled wikipedia page from disk
_cook_soup()
- method to parse the page
-
pause
()¶ Pause the preload thread.
-
results
¶ Returns downloaded and preprocessed wikipedia page.
Waits until results are available, only then returns.
Returns: - WikipediaPage – wikipedia page object
- bs4.BeautifulSoup – html parsed tree
- Dict[List[bs4.element.Tags]] – sections of the page split into dict indexed by section names
- Union[str, Path] – url address of the page or Path to offline pickle file
- str – error string, or none if no exceptions occured
-
stop
()¶ Method that stops currently running preload.
-
unpause
()¶ Unpause the preload thread.
-
-
class
wiki_music.library.parser.preload.
WikiCooker
(protected_vars: bool)¶ Bases:
wiki_music.library.parser.base.ParserBase
Downloades wikipedia page and convertes it to WikipediaPage object.
Subsequently important parts of the page are extracted to class attributes. Class has the ability to run preload of the page in background tread as the user types input values in GUI.
Warning
This class is not ment to be instantiated, only inherited.
References
https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page https://pypi.org/project/wikipedia/: used to get the wikipedia page
Parameters: protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not -
_page
¶ downloaded page to be parsed by BeautifulSoup
Type: wikipedia.WikipediaPage
-
_soup
¶ BeautibulSoup object representing the whole page
Type: bs4.BeautifulSoup
-
_url
¶ the page url or path to pickle file for offline debug
Type: Union[“Path”, str]
-
_get_preload_progress
() → Generator[str, None, None]¶ Generator, outputs preload progress messages.
Yields: str – progress messages of preload instance
-
_preload_id
¶ Get unique id for each preload instance.
Id is a three tuple of album, band and offline_debug flag
Type: Tuple[str, str, str]
-
get_wiki
() → Optional[str]¶ Wait until preload is finished, then return downloaded data.
Returns: error string if some error occured Return type: Optional[str]
-
start_preload
()¶ Starts preload instance and caches its reference under unique id.
Other running preloads are paused. Maximum number of preloads is 10, 1 running and 9 paused. If new preload is added to cache, the the oldest one is destroyed.
-
stop_preload
()¶ Stops all running preloads and delete reference from cache.
-
terminate
(message: str)¶ Send message to GUI to ask user if he wishes to terminate the app.
If the answer if yes than parser is destroyed and GUI terminated.
See also
wiki_music.gui_lib.main_window.Checkers._exception_check()
- this method handles displaying the message to user
wiki_music.utilities.sync.Control
- serves to pass message from parser to GUI
Parameters: message (str) – message to show user when asking if app should terminate
-
library.parser.process_page¶
Module containing the whole parser with all the inherited subclasses.
Class WikipediaParser
has complete functionallity but its methods need
to be called in the correst order to give sensible results.
-
class
wiki_music.library.parser.process_page.
WikipediaParser
(protected_vars: bool = True, GUI: bool = False, multi_threaded: bool = True)¶ Bases:
wiki_music.library.parser.extractors.DataExtractors
,wiki_music.library.parser.preload.WikiCooker
,wiki_music.library.parser.in_out.ParserInOut
Class for parsing the wikipedia page and extracting tags data from it.
Warning
Most parser methods are designed to fail gracefully so the extractions can proceed even when some subset of it failed. This has a dark side because it hides errors!!! All warning decorated methods are resilient to any exception defined in
wiki_music.utilities.exceptions
.References
https://www.crummy.com/software/BeautifulSoup/bs4/doc/: used to parse the wikipedia page
Parameters: - protected_vars (bool) – defines if certain variables should be initialized by __init__ method or not
- GUI (bool) – if True - assume app is running in GUI moden if False - assume app is running in CLI mode
- multi_threaded (bool) – whether to run some parts of code in threads
-
NLTK_names
¶ Use nltk to extract person names from sections of wikipedia page.
See also
wiki_music.constants.parser_const.PERSONNEL_SECTIONS
- these sections of the page + track_listing are passed to nltk to look for names
Raises: wiki_music.utilities.exceptions.NoNames2ExtractException
– vhen the page doesn’t have any of the sections defined in See also section. So no text can be provided to nltk. It makes no sense to try extraction from others parts of tha page as they are too cluttered by hardly classifiable information.
-
_complete
()¶ Recursively complete inforamtion in parser lists.
Traverses:
_composers
,artists
and_personnel
and checks each name with each if some is found to be incomplete then it is replaced by longer version from other list.
-
_info_tracks
()¶ Parse track names for aditional information.
Like artist, composer, type… . Also get rid of useless strings like bonus track, featuring… . These informations are assumed to be enclosed in brackets behind the track name.
-
_merge_artist_personnel
()¶ Assigns personnel to track artists.
The assignment is done base on
_appearences
which specify tracks for each person of personnel.
-
_process_tracks
(data: List[List[List[str]]]) → Tuple[List[str], List[List[str]]]¶ Process raw extracted list of album CD trackists for track details.
Parameters: data (List[List[List[str]]]) – list of trackists each for one cd, tracklists, consist of list, each representing one row, and each row has cells Returns: list of tracks and for each track list of atrists Return type: Tuple[List[str], List[List[str]]]
-
get_composers
() → List[List[str]]¶ Extract composers from wikipedia page.
Employs complex logic. First Person named entities are extracted by nltk. Then merges them with composers. After that uses this list of names to try to guess composers and coresponding tracks from short text above the table.
See also
get_personnel()
- this method should run first because it populates the
_personnel
used by this method
Warning
This method is not as robust as it should be. It fails for many types of formating.
Returns: list of composers for every track Return type: List[List[str]]
-
get_contents
() → List[str]¶ Extract page contets from keys in
_sections
dictionary.Raises: wiki_music.utilities.exceptions.NoContentsException
– if no contents were retrievedReturns: page contents as a list Return type: List[str]
-
get_cover_art
(in_thread: bool = False) → Optional[bytes]¶ Get album cover art.
Extracts from information box in the top right corner of wikipedia page. For app use it runs in a separate thread because the cover art data is not used by parser in any way, so it can be downloaded in the background. Populates
COVERART
Parameters: in_thread (bool) – if false, doesn’t run in thread and blocks until cover art is found Raises: wiki_music.utilities.exceptions.NoCoverArtException
– if cover art url could not be found
-
get_genres
() → List[str]¶ Get list of album genres.
Extracts from information box in the top right corner of wikipedia page. If found genre if only one then assigns is value to
GENRE
Raises: wiki_music.utilities.exceptions.NoGenreException
– if no genres could be extracted from pageReturns: list of found genres Return type: List[str]
-
get_personnel
() → Tuple[List[str], List[List[int]]]¶ Extract personnel from wikipedia page.
Sxtraction is done from following sections:
wiki_music.constants.parser_const.PERSONNEL_SECTIONS
then parse these entries for additional data like apperences on tracks.Raises: wiki_music.utilities.exceptions.NoPersonnelException
– if no table or list with personnel was foundReturns: two lists, first contains found personnel and the second has for each person list of tracks on which the person appeared Return type: Tuple[List[str], List[List[int]]]
-
get_release_date
() → str¶ Get album release date.
Extracts from information box in the top right corner of wikipedia page. Populates:attr:wiki_music.DATE
Raises: utilities.exceptions.NoReleaseDateException
– raised if no release date was extractedReturns: release year as a string Return type: str
-
get_tracks
() → Tuple[List[str], List[List[str]]]¶ Attempt to extract tracklist from html table or list on wikipedia.
See also
_from_table()
- used to parse tracklist in htlm table
_from_list()
- used to parse tracklist in html list
_process_tracks()
- method called to parse raw extracted table and get song numbers, artists, composers …
Raises: wiki_music.utilities.exceptions.NoTracklistException
– raised if no tracklist in any format was foundReturns: list of tracks and for each track list of atrists Return type: Tuple[List[str], List[List[str]]]
-
merge_artist_composers
()¶ Move all artists to composers list.
This is done or left out based on user input